comparison man/internals/internals.texi @ 2365:ce4aa0ef8af1

[xemacs-hg @ 2004-11-04 07:48:14 by ben] Major work on internals manual. Rearranged many chapters so as to lie in coherent divisions. Add tons of stuff to Future Work, Old Future Work, Discussions. Add lots of stuff to Mule section (Multilingual ...). Remove index.texi, incorporate into internals.texi. Section on early history and an introduction. Section on XEmacs split. Lots of new MS Windows docs Mostly recently: Windows-I18N docs. Lots if new I18N docs. Loads of other stuff. .
author ben
date Thu, 04 Nov 2004 07:48:14 +0000
parents 6aa56b089139
children 2d4dd2ef74e7
comparison
equal deleted inserted replaced
2364:28dea3be3c6c 2365:ce4aa0ef8af1
126 126
127 @ifinfo 127 @ifinfo
128 This Info file contains v21.5 of the XEmacs Internals Manual, October 2004. 128 This Info file contains v21.5 of the XEmacs Internals Manual, October 2004.
129 @end ifinfo 129 @end ifinfo
130 130
131 @c Don't update this by hand!!!!!! 131 @ignore
132 @c Use C-u C-c C-u m (aka C-u M-x texinfo-master-list). 132 Don't update this by hand!!!!!!
133 @c NOTE: This command does not include the Index:: menu entry. 133 Use C-u C-c C-u m (aka C-u M-x texinfo-master-list).
134 @c You must add it by hand. 134 NOTE: This command does not include the Index:: menu entry.
135 135 You must add it by hand.
136 @c Here are some useful Lisp routines for quickly Texinfo-izing text that 136
137 @c has been formatted into ASCII lists and tables. The first routine is 137 Here are some useful Lisp routines for quickly Texinfo-izing text that
138 @c currently more general and well-developed than the second. 138 has been formatted into ASCII lists and tables.
139 139
140 @c (defun list-to-texinfo (b e) 140 (defun list-to-texinfo (b e)
141 @c "Convert the selected region from an ASCII list to a Texinfo list." 141 "Convert the selected region from an ASCII list to a Texinfo list."
142 @c (interactive "r") 142 (interactive "r")
143 @c (save-restriction 143 (save-restriction
144 @c (narrow-to-region b e) 144 (narrow-to-region b e)
145 @c (goto-char (point-min)) 145 (goto-char (point-min))
146 @c (let ((dash-type "^ *-+ +") 146 (let ((dash-type "^ *-+ +")
147 @c (num-type "^ *[[(]?\\([0-9]+\\|[a-z]\\)[]).] +") 147 ;; allow single-letter numbering or roman numerals
148 @c dash) 148 (letter-type "^ *[[(]?\\([a-zA-Z]\\|[IVXivx]+\\)[]).] +")
149 @c (save-excursion 149 (num-type "^ *[[(]?[0-9]+[]).] +")
150 @c (cond ((re-search-forward num-type nil t)) 150 dash regexp)
151 @c ((re-search-forward dash-type nil t) (setq dash t)) 151 (save-excursion
152 @c (t (error "No table entries?")))) 152 (re-search-forward "\\s-*")
153 @c (if dash (insert "@itemize @bullet\n") 153 (cond ((looking-at dash-type) (setq regexp dash-type dash t))
154 @c (insert "@enumerate\n")) 154 ((looking-at letter-type) (setq regexp letter-type))
155 @c (while (re-search-forward (if dash dash-type num-type) nil t) 155 ((looking-at num-type) (setq regexp num-type))
156 @c (let ((p (point))) 156 ((re-search-forward num-type nil t) (setq regexp num-type))
157 @c (or (re-search-forward (if dash dash-type num-type) nil t) 157 ((re-search-forward letter-type nil t) (setq regexp letter-type))
158 @c (goto-char (point-max))) 158 ((re-search-forward dash-type nil t)
159 @c (beginning-of-line) 159 (setq regexp dash-type dash t))
160 @c (forward-line -1) 160 (t (error "No table entries?"))))
161 @c (let ((q (point))) 161 (if dash (insert "@itemize @bullet\n")
162 @c (goto-char p) 162 (insert "@enumerate\n"))
163 @c (kill-rectangle p q)) 163 (re-search-forward regexp nil 'limit)
164 @c (insert "@item\n"))) 164 (while (not (eobp))
165 @c (goto-char (point-max)) 165 (delete-region (point-at-bol) (point))
166 @c (beginning-of-line) 166 (insert "@item\n")
167 @c (if dash (insert "@end itemize\n") 167 ;; move forward over any text following the dash to not screw
168 @c (insert "@end enumerate\n"))))) 168 ;; up remove-spacing.
169 169 (forward-line 1)
170 @c (defun table-to-texinfo (b e) 170 (let ((p (point)))
171 @c "Convert the selected region from an ASCII table to a Texinfo table." 171 (or (re-search-forward regexp nil t)
172 @c (interactive "r") 172 (goto-char (point-max)))
173 @c (save-restriction 173 ;; trick to avoid using a marker
174 @c (narrow-to-region b e) 174 (save-excursion
175 @c (goto-char (point-min)) 175 ;; back up so as not to affect the line we're on (beginning of
176 @c (insert "@table @code\n") 176 ;; next entry)
177 @c (while (not (eobp)) 177 (forward-line -1)
178 @c (insert "@item ") 178 (remove-spacing p (point)))))
179 @c (forward-sexp) 179 (beginning-of-line)
180 @c (delete-char) 180 (if dash (insert "@end itemize\n")
181 @c (insert "\n") 181 (insert "@end enumerate\n")))))
182 @c (or (search-forward "\n\n" nil t) 182
183 @c (goto-char (point-max)))) 183 (defun remove-spacing (b e)
184 @c (beginning-of-line) 184 "Remove leading space from the selected region.
185 @c (insert "@end table\n"))) 185 This finds the maximum leading blank area common to all lines in the region.
186 186 This includes all lines any part of which are in the region."
187 @c A useful Lisp routine for adding markup based on conventions used in plain 187 (interactive "r")
188 @c text files; see doc string below. 188 (save-excursion
189 189 (let ((min 999999)
190 @c (defun convert-text-to-texinfo (&optional no-narrow) 190 seen)
191 @c "Convert text to Texinfo. 191 (goto-char e)
192 @c If the region is active, do the region; otherwise, go from point to the end 192 (end-of-line)
193 @c of the buffer. This query-replaces for various kinds of conventions used 193 (setq e (point))
194 @c in text: @code{} surrounded by ` and ' or followed by a (); @strong{} 194 (goto-char b)
195 @c surrounded by *'s; @file{} something that looks like a file name." 195 (beginning-of-line)
196 @c (interactive) 196 (setq b (point))
197 @c (if (region-active-p) 197 (while (< (point) e)
198 @c (save-restriction 198 (cond ((looking-at "^\\s-+")
199 @c (narrow-to-region (region-beginning) (region-end)) 199 (goto-char (match-end 0))
200 @c (convert-comments-to-texinfo t)) 200 (setq min (min min (current-column))
201 @c (let ((p (point)) 201 seen t))
202 @c (case-replace nil)) 202 ((looking-at "^\\s-*$"))
203 @c (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil) 203 (t (setq min 0)))
204 @c (goto-char p) 204 (forward-line 1))
205 @c (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil) 205 (when (and seen (> min 0))
206 @c (goto-char p) 206 (goto-char e)
207 @c (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil) 207 (untabify b e)
208 @c (goto-char p) 208 ;; we are at end of line already.
209 @c (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil) 209 (if (not (= (point) (point-at-eol)))
210 @c ))) 210 (error "Logic error"))
211 ;; Pad line with spaces if necessary (it may be just a blank line)
212 (if (< (current-column) min)
213 (insert-char ?\ (- min (current-column)))
214 (beginning-of-line)
215 (forward-char min))
216 (kill-rectangle b (point))))))
217
218 (defun table-to-texinfo (b e)
219 "Convert the selected region from an ASCII table to a Texinfo table.
220 Assumes entries are separated by a blank line, and the first sexp in
221 each entry is the table heading."
222 (interactive "r")
223 (save-restriction
224 (narrow-to-region b e)
225 (goto-char (point-min))
226 (insert "@table @code\n")
227 (while (not (eobp))
228 ;; remember where we want to insert the @item.
229 ;; delete the spacing first since inserting the @item may create
230 ;; a line with no spacing, if there is text following the heading on
231 ;; the same line.
232 (let ((beg (point)))
233 ;; removing the space and inserting the @item will change the
234 ;; position of the end of the region, so to make it easy on us
235 ;; leave point at end so it will be adjusted.
236 (forward-line 1)
237 (let ((beg2 (point)))
238 (or (re-search-forward "^$" nil t)
239 (goto-char (point-max)))
240 (backward-char 1)
241 (remove-spacing beg2 (point)))
242 (ignore-errors (forward-char 2))
243 (save-excursion
244 (goto-char beg)
245 (insert "@item ")
246 (forward-sexp)
247 (delete-char)
248 (insert "\n"))))
249 (beginning-of-line)
250 (insert "@end table\n")))
251
252 A useful Lisp routine for adding markup based on conventions used in plain
253 text files; see doc string below.
254
255 (defun convert-text-to-texinfo (&optional no-narrow)
256 "Convert text to Texinfo.
257 If the region is active, do the region; otherwise, go from point to the end
258 of the buffer. This query-replaces for various kinds of conventions used
259 in text: @code{} surrounded by ` and ' or followed by a (); @strong{}
260 surrounded by *'s; @file{} something that looks like a file name."
261 (interactive)
262 (if (region-active-p)
263 (save-restriction
264 (narrow-to-region (region-beginning) (region-end))
265 (convert-comments-to-texinfo t))
266 (let ((p (point))
267 (case-replace nil))
268 (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil)
269 (goto-char p)
270 (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil)
271 (goto-char p)
272 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil)
273 (goto-char p)
274 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil)
275 )))
276
277 Macro the generate the "Future Work" section from a title; put
278 point at beginning.
279
280 (defalias 'make-future (read-kbd-macro
281 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Future SPC Work SPC - - SPC <end> RET @cindex SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC future SPC work RET"))
282
283 Similar but generates a "Discussion" section.
284
285 (defalias 'make-discussion (read-kbd-macro
286 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Discussion SPC - - SPC <home> <down> <C-right> <right> Discussion SPC - - SPC <end> RET @cindex SPC discussion, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC discussion RET"))
287
288 Similar but generates an "Old Future Work" section.
289
290 (defalias 'make-old-future (read-kbd-macro
291 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Old SPC Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Old SPC Future SPC Work SPC - - SPC <end> RET @cindex SPC old SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC old SPC future SPC work RET"))
292
293 Similar but generates a general section.
294
295 (defalias 'make-section (read-kbd-macro
296 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> RET @cindex SPC C-SPC C-g <f4> C-x C-x M-l <home> <down>"))
297
298 Similar but generates a general subsection.
299
300 (defalias 'make-subsection (read-kbd-macro
301 "<S-end> <f3> <home> @node SPC <end> RET @subsection SPC <f4> RET @cindex SPC C-SPC C-g <f4> C-x C-x M-l <home> <down>"))
302 @end ignore
211 303
212 @menu 304 @menu
213 * Introduction:: Overview of this manual. 305 * Introduction:: Overview of this manual.
214 * Authorship of XEmacs:: 306 * Authorship of XEmacs::
215 * A History of Emacs:: Times, dates, important events. 307 * A History of Emacs:: Times, dates, important events.
216 * XEmacs From the Outside:: A broad conceptual overview. 308 * The XEmacs Split::
309 * XEmacs from the Outside:: A broad conceptual overview.
217 * The Lisp Language:: An overview. 310 * The Lisp Language:: An overview.
218 * XEmacs From the Perspective of Building:: 311 * XEmacs from the Perspective of Building::
219 * Build-Time Dependencies:: 312 * Build-Time Dependencies::
220 * XEmacs From the Inside:: 313 * The Modules of XEmacs::
221 * The XEmacs Object System (Abstractly Speaking)::
222 * How Lisp Objects Are Represented in C::
223 * Major Textual Changes:: 314 * Major Textual Changes::
224 * Rules When Writing New C Code:: 315 * Rules When Writing New C Code::
225 * Regression Testing XEmacs:: 316 * Regression Testing XEmacs::
226 * CVS Techniques:: 317 * CVS Techniques::
227 * The Modules of XEmacs:: 318 * XEmacs from the Inside::
319 * The XEmacs Object System (Abstractly Speaking)::
320 * How Lisp Objects Are Represented in C::
228 * Allocation of Objects in XEmacs Lisp:: 321 * Allocation of Objects in XEmacs Lisp::
229 * Dumping:: 322 * The Lisp Reader and Compiler::
230 * Events and the Event Loop::
231 * Asynchronous Events; Quit Checking::
232 * Evaluation; Stack Frames; Bindings:: 323 * Evaluation; Stack Frames; Bindings::
233 * Symbols and Variables:: 324 * Symbols and Variables::
234 * Buffers:: 325 * Buffers::
235 * Text:: 326 * Text::
236 * Multilingual Support:: 327 * Multilingual Support::
237 * The Lisp Reader and Compiler::
238 * Lstreams::
239 * Consoles; Devices; Frames; Windows:: 328 * Consoles; Devices; Frames; Windows::
240 * The Redisplay Mechanism:: 329 * The Redisplay Mechanism::
241 * Extents:: 330 * Extents::
242 * Faces:: 331 * Faces::
243 * Glyphs:: 332 * Glyphs::
244 * Specifiers:: 333 * Specifiers::
245 * Menus:: 334 * Menus::
335 * Events and the Event Loop::
336 * Asynchronous Events; Quit Checking::
337 * Lstreams::
246 * Subprocesses:: 338 * Subprocesses::
247 * Interface to MS Windows:: 339 * Interface to MS Windows::
248 * Interface to the X Window System:: 340 * Interface to the X Window System::
341 * Dumping::
249 * Future Work:: 342 * Future Work::
250 * Future Work Discussion:: 343 * Future Work Discussion::
251 * Old Future Work:: 344 * Old Future Work::
252 * Index:: 345 * Index::
253 346
259 * Through Version 18:: Unification prevails. 352 * Through Version 18:: Unification prevails.
260 * Lucid Emacs:: One version 19 Emacs. 353 * Lucid Emacs:: One version 19 Emacs.
261 * GNU Emacs 19:: The other version 19 Emacs. 354 * GNU Emacs 19:: The other version 19 Emacs.
262 * GNU Emacs 20:: The other version 20 Emacs. 355 * GNU Emacs 20:: The other version 20 Emacs.
263 * XEmacs:: The continuation of Lucid Emacs. 356 * XEmacs:: The continuation of Lucid Emacs.
357
358 The Modules of XEmacs
359
360 * A Summary of the Various XEmacs Modules::
361 * Low-Level Modules::
362 * Basic Lisp Modules::
363 * Modules for Standard Editing Operations::
364 * Modules for Interfacing with the File System::
365 * Modules for Other Aspects of the Lisp Interpreter and Object System::
366 * Modules for Interfacing with the Operating System::
264 367
265 Major Textual Changes 368 Major Textual Changes
266 369
267 * Great Integral Type Renaming:: 370 * Great Integral Type Renaming::
268 * Text/Char Type Renaming:: 371 * Text/Char Type Renaming::
285 * Modules for Regression Testing:: 388 * Modules for Regression Testing::
286 389
287 CVS Techniques 390 CVS Techniques
288 391
289 * Merging a Branch into the Trunk:: 392 * Merging a Branch into the Trunk::
290
291 The Modules of XEmacs
292
293 * A Summary of the Various XEmacs Modules::
294 * Low-Level Modules::
295 * Basic Lisp Modules::
296 * Modules for Standard Editing Operations::
297 * Modules for Interfacing with the File System::
298 * Modules for Other Aspects of the Lisp Interpreter and Object System::
299 * Modules for Interfacing with the Operating System::
300 393
301 Allocation of Objects in XEmacs Lisp 394 Allocation of Objects in XEmacs Lisp
302 395
303 * Introduction to Allocation:: 396 * Introduction to Allocation::
304 * Garbage Collection:: 397 * Garbage Collection::
325 * sweep_lcrecords_1:: 418 * sweep_lcrecords_1::
326 * compact_string_chars:: 419 * compact_string_chars::
327 * sweep_strings:: 420 * sweep_strings::
328 * sweep_bit_vectors_1:: 421 * sweep_bit_vectors_1::
329 422
330 Dumping 423 Evaluation; Stack Frames; Bindings
331 424
332 * Dumping Justification:: 425 * Evaluation::
333 * Overview:: 426 * Dynamic Binding; The specbinding Stack; Unwind-Protects::
334 * Data descriptions:: 427 * Simple Special Forms::
335 * Dumping phase:: 428 * Catch and Throw::
336 * Reloading phase:: 429 * Error Trapping::
337 * Remaining issues:: 430
338 431 Symbols and Variables
339 Dumping phase 432
340 433 * Introduction to Symbols::
341 * Object inventory:: 434 * Obarrays::
342 * Address allocation:: 435 * Symbol Values::
343 * The header:: 436
344 * Data dumping:: 437 Buffers
345 * Pointers dumping:: 438
439 * Introduction to Buffers:: A buffer holds a block of text such as a file.
440 * Buffer Lists:: Keeping track of all buffers.
441 * Markers and Extents:: Tagging locations within a buffer.
442 * The Buffer Object:: The Lisp object corresponding to a buffer.
443
444 Text
445
446 * The Text in a Buffer:: Representation of the text in a buffer.
447 * Ibytes and Ichars:: Representation of individual characters.
448 * Byte-Char Position Conversion::
449 * Searching and Matching:: Higher-level algorithms.
450
451 Multilingual Support
452
453 * Introduction to Multilingual Issues #1::
454 * Introduction to Multilingual Issues #2::
455 * Introduction to Multilingual Issues #3::
456 * Introduction to Multilingual Issues #4::
457 * Character Sets::
458 * Encodings::
459 * Internal Mule Encodings::
460 * Byte/Character Types; Buffer Positions; Other Typedefs::
461 * Internal Text API's::
462 * Coding for Mule::
463 * CCL::
464 * Microsoft Windows-Related Multilingual Issues::
465 * Modules for Internationalization::
466
467 Encodings
468
469 * Japanese EUC (Extended Unix Code)::
470 * JIS7::
471
472 Internal Mule Encodings
473
474 * Internal String Encoding::
475 * Internal Character Encoding::
476
477 Byte/Character Types; Buffer Positions; Other Typedefs
478
479 * Byte Types::
480 * Different Ways of Seeing Internal Text::
481 * Buffer Positions::
482 * Other Typedefs::
483 * Usage of the Various Representations::
484 * Working With the Various Representations::
485
486 Internal Text API's
487
488 * Basic internal-format API's::
489 * The DFC API::
490 * The Eistring API::
491
492 Coding for Mule
493
494 * Character-Related Data Types::
495 * Working With Character and Byte Positions::
496 * Conversion to and from External Data::
497 * General Guidelines for Writing Mule-Aware Code::
498 * An Example of Mule-Aware Code::
499 * Mule-izing Code::
500
501 Microsoft Windows-Related Multilingual Issues
502
503 * Microsoft Documentation::
504 * Locales::
505 * More about code pages::
506 * More about locales::
507 * Unicode support under Windows::
508 * The golden rules of writing Unicode-safe code::
509 * The format of the locale in setlocale()::
510 * Random other Windows I18N docs::
511
512 Consoles; Devices; Frames; Windows
513
514 * Introduction to Consoles; Devices; Frames; Windows::
515 * Point::
516 * Window Hierarchy::
517 * The Window Object::
518 * Modules for the Basic Displayable Lisp Objects::
519
520 The Redisplay Mechanism
521
522 * Critical Redisplay Sections::
523 * Line Start Cache::
524 * Redisplay Piece by Piece::
525 * Modules for the Redisplay Mechanism::
526 * Modules for other Display-Related Lisp Objects::
527
528 Extents
529
530 * Introduction to Extents:: Extents are ranges over text, with properties.
531 * Extent Ordering:: How extents are ordered internally.
532 * Format of the Extent Info:: The extent information in a buffer or string.
533 * Zero-Length Extents:: A weird special case.
534 * Mathematics of Extent Ordering:: A rigorous foundation.
535 * Extent Fragments:: Cached information useful for redisplay.
346 536
347 Events and the Event Loop 537 Events and the Event Loop
348 538
349 * Introduction to Events:: 539 * Introduction to Events::
350 * Main Loop:: 540 * Main Loop::
365 * Control-G (Quit) Checking:: 555 * Control-G (Quit) Checking::
366 * Profiling:: 556 * Profiling::
367 * Asynchronous Timeouts:: 557 * Asynchronous Timeouts::
368 * Exiting:: 558 * Exiting::
369 559
370 Evaluation; Stack Frames; Bindings
371
372 * Evaluation::
373 * Dynamic Binding; The specbinding Stack; Unwind-Protects::
374 * Simple Special Forms::
375 * Catch and Throw::
376
377 Symbols and Variables
378
379 * Introduction to Symbols::
380 * Obarrays::
381 * Symbol Values::
382
383 Buffers
384
385 * Introduction to Buffers:: A buffer holds a block of text such as a file.
386 * Buffer Lists:: Keeping track of all buffers.
387 * Markers and Extents:: Tagging locations within a buffer.
388 * The Buffer Object:: The Lisp object corresponding to a buffer.
389
390 Text
391
392 * The Text in a Buffer:: Representation of the text in a buffer.
393 * Ibytes and Ichars:: Representation of individual characters.
394 * Byte-Char Position Conversion::
395 * Searching and Matching:: Higher-level algorithms.
396
397 Multilingual Support
398
399 * Introduction to Multilingual Issues #1::
400 * Introduction to Multilingual Issues #2::
401 * Introduction to Multilingual Issues #3::
402 * Introduction to Multilingual Issues #4::
403 * Character Sets::
404 * Encodings::
405 * Internal Mule Encodings::
406 * Byte/Character Types; Buffer Positions; Other Typedefs::
407 * Internal Text API's::
408 * Coding for Mule::
409 * CCL::
410 * Modules for Internationalization::
411
412 Encodings
413
414 * Japanese EUC (Extended Unix Code)::
415 * JIS7::
416
417 Internal Mule Encodings
418
419 * Internal String Encoding::
420 * Internal Character Encoding::
421
422 Byte/Character Types; Buffer Positions; Other Typedefs
423
424 * Byte Types::
425 * Different Ways of Seeing Internal Text::
426 * Buffer Positions::
427 * Other Typedefs::
428 * Usage of the Various Representations::
429 * Working With the Various Representations::
430
431 Internal Text API's
432
433 * Basic internal-format API's::
434 * The DFC API::
435 * The Eistring API::
436
437 Coding for Mule
438
439 * Character-Related Data Types::
440 * Working With Character and Byte Positions::
441 * Conversion to and from External Data::
442 * General Guidelines for Writing Mule-Aware Code::
443 * An Example of Mule-Aware Code::
444 * Mule-izing Code::
445
446 Lstreams 560 Lstreams
447 561
448 * Creating an Lstream:: Creating an lstream object. 562 * Creating an Lstream:: Creating an lstream object.
449 * Lstream Types:: Different sorts of things that are streamed. 563 * Lstream Types:: Different sorts of things that are streamed.
450 * Lstream Functions:: Functions for working with lstreams. 564 * Lstream Functions:: Functions for working with lstreams.
451 * Lstream Methods:: Creating new lstream types. 565 * Lstream Methods:: Creating new lstream types.
452
453 Consoles; Devices; Frames; Windows
454
455 * Introduction to Consoles; Devices; Frames; Windows::
456 * Point::
457 * Window Hierarchy::
458 * The Window Object::
459 * Modules for the Basic Displayable Lisp Objects::
460
461 The Redisplay Mechanism
462
463 * Critical Redisplay Sections::
464 * Line Start Cache::
465 * Redisplay Piece by Piece::
466 * Modules for the Redisplay Mechanism::
467 * Modules for other Display-Related Lisp Objects::
468
469 Extents
470
471 * Introduction to Extents:: Extents are ranges over text, with properties.
472 * Extent Ordering:: How extents are ordered internally.
473 * Format of the Extent Info:: The extent information in a buffer or string.
474 * Zero-Length Extents:: A weird special case.
475 * Mathematics of Extent Ordering:: A rigorous foundation.
476 * Extent Fragments:: Cached information useful for redisplay.
477 566
478 Interface to MS Windows 567 Interface to MS Windows
479 568
480 * Different kinds of Windows environments:: 569 * Different kinds of Windows environments::
481 * Windows Build Flags:: 570 * Windows Build Flags::
494 * Menubars:: 583 * Menubars::
495 * Checkboxes and Radio Buttons:: 584 * Checkboxes and Radio Buttons::
496 * Progress Bars:: 585 * Progress Bars::
497 * Tab Controls:: 586 * Tab Controls::
498 587
588 Dumping
589
590 * Dumping Justification::
591 * Overview::
592 * Data descriptions::
593 * Dumping phase::
594 * Reloading phase::
595 * Remaining issues::
596
597 Dumping phase
598
599 * Object inventory::
600 * Address allocation::
601 * The header::
602 * Data dumping::
603 * Pointers dumping::
604
499 Future Work 605 Future Work
500 606
607 * Future Work -- General Suggestions::
501 * Future Work -- Elisp Compatibility Package:: 608 * Future Work -- Elisp Compatibility Package::
502 * Future Work -- Drag-n-Drop:: 609 * Future Work -- Drag-n-Drop::
503 * Future Work -- Standard Interface for Enabling Extensions:: 610 * Future Work -- Standard Interface for Enabling Extensions::
504 * Future Work -- Better Initialization File Scheme:: 611 * Future Work -- Better Initialization File Scheme::
505 * Future Work -- Keyword Parameters:: 612 * Future Work -- Keyword Parameters::
543 650
544 Future Work -- Byte Code Snippets 651 Future Work -- Byte Code Snippets
545 652
546 * Future Work -- Autodetection:: 653 * Future Work -- Autodetection::
547 * Future Work -- Conversion Error Detection:: 654 * Future Work -- Conversion Error Detection::
655 * Future Work -- Unicode::
548 * Future Work -- BIDI Support:: 656 * Future Work -- BIDI Support::
549 * Future Work -- Localized Text/Messages:: 657 * Future Work -- Localized Text/Messages::
550 658
551 Future Work -- Lisp Engine Replacement 659 Future Work -- Lisp Engine Replacement
552 660
553 * Future Work -- Lisp Engine Discussion:: 661 * Future Work -- Lisp Engine Discussion::
554 * Future Work -- Lisp Engine Replacement -- Implementation:: 662 * Future Work -- Lisp Engine Replacement -- Implementation::
663 * Future Work -- Startup File Modification by Packages::
555 664
556 Future Work Discussion 665 Future Work Discussion
557 666
558 * Discussion -- garbage collection:: 667 * Discussion -- garbage collection::
559 * Discussion -- glyphs:: 668 * Discussion -- glyphs::
669 * Discussion -- Dialog Boxes::
670 * Discussion -- Multilingual Issues::
671 * Discussion -- Windows External Widget::
672 * Discussion -- Packages::
673 * Discussion -- Distribution Layout::
560 674
561 Old Future Work 675 Old Future Work
562 676
563 * Future Work -- A Portable Unexec Replacement:: 677 * Old Future Work -- A Portable Unexec Replacement::
564 * Future Work -- Indirect Buffers:: 678 * Old Future Work -- Indirect Buffers::
565 * Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: 679 * Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X::
566 * Future Work -- xemacs.org Mailing Address Changes:: 680 * Old Future Work -- RTF Clipboard Support::
567 * Future Work -- Lisp callbacks from critical areas of the C code:: 681 * Old Future Work -- xemacs.org Mailing Address Changes::
682 * Old Future Work -- Lisp callbacks from critical areas of the C code::
568 683
569 @end detailmenu 684 @end detailmenu
570 @end menu 685 @end menu
571 686
572 @node Introduction, Authorship of XEmacs, Top, Top 687 @node Introduction, Authorship of XEmacs, Top, Top
595 the snapshot of the code you are looking at, and in the case of 710 the snapshot of the code you are looking at, and in the case of
596 contradictions between the code comments and the manual, @strong{always} 711 contradictions between the code comments and the manual, @strong{always}
597 assume that the code comments are correct. (Because of the proximity of 712 assume that the code comments are correct. (Because of the proximity of
598 the comments to the code, comments will rarely be out-of-date.) 713 the comments to the code, comments will rarely be out-of-date.)
599 714
715 The manual is organized in chapters which are broadly grouped into major
716 divisions:
717
718 @enumerate
719 @item
720 First is the introduction, including this chapter and chapters on the
721 history and authorship of XEmacs.
722 @item
723 Next, starting with @ref{XEmacs from the Outside}, are some chapters
724 giving a broad overview of the internal workings of XEmacs and
725 documenting important information relevant to those working on the code.
726 @item
727 The remaining divisions document the nitty-gritty details of the
728 internal workings. First, starting with @ref{XEmacs from the Outside},
729 is a division on the workings of the Lisp interpreter that drives
730 XEmacs.
731 @item
732 Next, starting with @ref{Buffers}, is a division on the parts of the
733 code specifically devoted to text processing, including multilingual
734 support (Mule).
735 @item
736 Afterwards, starting with @ref{Consoles; Devices; Frames; Windows}, is a
737 division covering the display mechanism and the objects and modules
738 relevant to this.
739 @item
740 Then, starting with @ref{Events and the Event Loop}, is a division
741 covering the interface between XEmacs and the outside world, including
742 user interactions, subprocesses, file I/O, interfaces to particular
743 windowing systems, and dumping.
744 @item
745 Finally, starting with @ref{Future Work}, is a division containing
746 proposals and discussion relating to future work on XEmacs.
747 @end enumerate
748
600 This manual was primarily written by Ben Wing. Certain sections were 749 This manual was primarily written by Ben Wing. Certain sections were
601 written by others, including those mentioned on the title page as well 750 written by others, including those mentioned on the title page as well
602 as other coders. Some sections were lifted directly from comments in 751 as other coders. Some sections were lifted directly from comments in
603 the code, and in those cases we may not completely be aware of the 752 the code, and in those cases we may not completely be aware of the
604 authorship. In addition, due to the collaborative nature of XEmacs, 753 authorship. In addition, due to the collaborative nature of XEmacs,
613 @table @asis 762 @table @asis
614 @item Stephen Turnbull 763 @item Stephen Turnbull
615 Various cleanup work, mostly post-2000. Object-Oriented Techniques in 764 Various cleanup work, mostly post-2000. Object-Oriented Techniques in
616 XEmacs. A Reader's Guide to XEmacs Coding Conventions. Searching and 765 XEmacs. A Reader's Guide to XEmacs Coding Conventions. Searching and
617 Matching. Regression Testing XEmacs. Modules for Regression Testing. 766 Matching. Regression Testing XEmacs. Modules for Regression Testing.
618 Lucid Widget Library. 767 Lucid Widget Library. A number of sections in the Future Work chapter.
619 @item Martin Buchholz 768 @item Martin Buchholz
620 Various cleanup work, mostly pre-2001. Docs on inline functions. Docs 769 Various cleanup work, mostly pre-2001. Docs on inline functions. Docs
621 on dfc conversion functions (Conversion to and from External Data). 770 on dfc conversion functions (Conversion to and from External Data).
622 Improvements in support for non-ASCII (European) keysyms under X. 771 Improvements in support for non-ASCII (European) keysyms under X.
772 A section or two in the Future Work chapter.
623 @item Hrvoje Niksic 773 @item Hrvoje Niksic
624 Coding for Mule. 774 Coding for Mule.
625 @item Matthias Neubauer 775 @item Matthias Neubauer
626 Garbage Collection - Step by Step. 776 Garbage Collection - Step by Step.
627 @item Olivier Galibert 777 @item Olivier Galibert
630 Redisplay Piece by Piece. Glyphs. 780 Redisplay Piece by Piece. Glyphs.
631 @item Chuck Thompson 781 @item Chuck Thompson
632 Line Start Cache. 782 Line Start Cache.
633 @item Kenichi Handa 783 @item Kenichi Handa
634 CCL. 784 CCL.
785 @item Jamie Zawinski
786 A couple of sections in the Future Work chapter.
635 @end table 787 @end table
636 788
637 @node Authorship of XEmacs, A History of Emacs, Introduction, Top 789 @node Authorship of XEmacs, A History of Emacs, Introduction, Top
638 @chapter Authorship of XEmacs 790 @chapter Authorship of XEmacs
639 @cindex authorship, XEmacs 791 @cindex authorship, XEmacs
772 @item alloca.s 924 @item alloca.s
773 Inherited almost unchanged from FSF kept in sync up through 19.30 925 Inherited almost unchanged from FSF kept in sync up through 19.30
774 basically no changes for Xemacs. 926 basically no changes for Xemacs.
775 @end table 927 @end table
776 928
777 @node A History of Emacs, XEmacs From the Outside, Authorship of XEmacs, Top 929 @node A History of Emacs, The XEmacs Split, Authorship of XEmacs, Top
778 @chapter A History of Emacs 930 @chapter A History of Emacs
779 @cindex history of Emacs, a 931 @cindex history of Emacs, a
780 @cindex Emacs, a history of 932 @cindex Emacs, a history of
781 @cindex Hackers (Steven Levy) 933 @cindex Hackers (Steven Levy)
782 @cindex Levy, Steven 934 @cindex Levy, Steven
1342 version 21.2.45 released February 23, 2001. 1494 version 21.2.45 released February 23, 2001.
1343 @item 1495 @item
1344 version 21.2.46 released March 21, 2001. 1496 version 21.2.46 released March 21, 2001.
1345 @end itemize 1497 @end itemize
1346 1498
1347 @node XEmacs From the Outside, The Lisp Language, A History of Emacs, Top 1499 @node The XEmacs Split, XEmacs from the Outside, A History of Emacs, Top
1348 @chapter XEmacs From the Outside 1500 @chapter The XEmacs Split
1501 @cindex XEmacs split
1502
1503 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
1504
1505 @strong{NOTE NOTE NOTE}: The following is a @strong{highly} opinionated
1506 piece written by one of the main authors of XEmacs. This reflects his
1507 opinions, and his only! It is included here because it may help to
1508 clarify some of the issues that are keeping the two versions of Emacs
1509 separate.
1510
1511 Many people look at the split between GNU Emacs and XEmacs and are
1512 convinced that the XEmacs team is being needlessly divisive and just needs
1513 to cooperate a bit with RMS, and the two versions of Emacs will merge. In
1514 fact there have been six to seven major attempts at merging, each running
1515 hundreds of messages long and all of them coming from the XEmacs side. All
1516 have failed because they have eventually come to the same conclusion, which
1517 is that RMS has no real interest in cooperation at all. If you work with
1518 him, you have to do it his way -- "my way or the highway". Specifically:
1519
1520 @enumerate
1521 @item
1522
1523 RMS insists on having legal papers signed for every bit of code that goes
1524 into GNU Emacs. RMS's lawyers have told him that every contribution over
1525 ten lines long requires legal papers. These papers cannot be filled out
1526 over to the web but must be done so in person and mailed to the FSF.
1527 Obviously this by itself has a tendency to inhibit contributions because of
1528 the hassle factor. Furthermore, many people (and especially organizations)
1529 are either hesitant to or refuse to sign legal papers, for reasons
1530 mentioned below. Because of these reasons, XEmacs has never enforced legal
1531 signed papers for the code in it. Such papers are not a part of the GPL and
1532 are not required by any projects other than those of the FSF (for example,
1533 Linux does not require such papers). Since we do not know exactly who is
1534 the author of every bit of code that has been contributed to XEmacs in the
1535 last nine years, we would essentially have to rewrite large sections of the
1536 code. The situation however, is worse than that because many of the large
1537 copyright holders of XEmacs (for example Sun Microsystems) refuse to sign
1538 legal papers. Although they have not stated their reasons, there are quite
1539 a number of reasons not to sign legal papers:
1540
1541 @itemize @bullet
1542 @item
1543 By doing so you essentially give up all control over your code. You can
1544 no longer release your code under a different license. If you want to
1545 use your code that you've contributed to the FSF in a project of your
1546 own, and that project is not released under the GPL, you are not allowed
1547 to do this. Obviously, large companies tend to want to reuse their code
1548 in many different projects and as a result feel very uncomfortable about
1549 signing legal papers.
1550 @item
1551 One of the dangers of assigning copyright to the FSF is that if the FSF
1552 happens to be taken over by some evil corporate identity or anyone with
1553 different ideas than RMS, they will own all copyright-assigned code, and
1554 can revoke the GPL and enforce any license they please. If the code has
1555 many different copyright holders, this is much less likely of a
1556 scenario.
1557 @end itemize
1558
1559 @item
1560 RMS does not like abstract data structures. Abstract data structures are
1561 the foundation of XEmacs and most other modern programming projects. In
1562 my opinion, is difficult to impossible to write maintainable and
1563 expandable code without using abstract data structures. In merging talks
1564 with RMS he has said we can have any abstract data structures we want in
1565 a merged version but must allow direct access to the implementation as
1566 well, which defeats the primary purpose of having abstract data
1567 structures.
1568
1569 @item
1570 RMS is very unwilling to compromise when it comes to divergent
1571 implementations of the same functionality, which is very common between
1572 XEmacs and GNU Emacs. Rather than taking the better interface on
1573 technical grounds, RMS insists that both interfaces must be implemented
1574 in C at the same level (rather than implementing one in C and the other
1575 on top if it), so that code that uses either interface is just as
1576 fast. This means that the resulting merged Emacs would be filled with a
1577 lot of very complicated code to simultaneously support two divergent
1578 interfaces, and would be difficult to maintain in this state.
1579
1580 @item
1581 RMS's idea of compromise and cooperation is almost purely political
1582 rather than technical. The XEmacs maintainers would like to have issues
1583 resolved by examining them technically and deciding what makes the most
1584 sense from a technical prospective. RMS however, wants to proceed on a
1585 tit for tat kind of basis, which is to say, “If we support this feature
1586 of yours, we also get to support this other feature of mine.” The
1587 result of such a process is typically a big mess, because there is no
1588 overarching design but instead a great deal of incompatible things
1589 hodgepodged together.
1590 @end enumerate
1591
1592 If only some of the above differences were firmly held by RMS, and if he
1593 were willing to compromise effectively on the others and to demonstrate
1594 willingness to work with us on the issues that he is less willing to
1595 compromise on, we might go ahead with the merge despite misgivings. However
1596 RMS has shown no real interest at all in compromising. He has never stated
1597 how all of the redundant work that would be required to support his
1598 preconditions would get done. It's unlikely that he would do it all and
1599 it's certainly not clear that the XEmacs project would be willing to do it
1600 all, given that it is a tremendous amount of extra work and the XEmacs
1601 project is already strapped for coding resources. (Not to mention the
1602 inherent difficulty in convincing people to redo existing work for
1603 primarily political reasons.) In general the free software community is
1604 quite strapped as a whole for coding resources; duplicative efforts amount
1605 to very little positively and have a lot of negative effects in that they
1606 take away what few resources we do have from projects that would actually
1607 be useful.
1608
1609 RMS however, does not seem to be bothered by this. He is more interested in
1610 sticking firm to his principles, though the heavens may fall down, than in
1611 working forward to create genuinely useful software. It is abundantly clear
1612 that RMS has no real interest in unity except if it happens to be on his
1613 own terms and allows him ultimate control over the result. He would rather
1614 see nothing happen at all than something that is not exactly according to
1615 his principles. The fact that few if any people share his principles is
1616 meaningless to him.
1617
1618 @node XEmacs from the Outside, The Lisp Language, The XEmacs Split, Top
1619 @chapter XEmacs from the Outside
1349 @cindex XEmacs from the outside 1620 @cindex XEmacs from the outside
1350 @cindex outside, XEmacs from the 1621 @cindex outside, XEmacs from the
1351 @cindex read-eval-print 1622 @cindex read-eval-print
1352 1623
1353 XEmacs appears to the outside world as an editor, but it is really a 1624 XEmacs appears to the outside world as an editor, but it is really a
1386 @cindex pi, calculating 1657 @cindex pi, calculating
1387 Note that you do not have to use XEmacs as an editor; you could just 1658 Note that you do not have to use XEmacs as an editor; you could just
1388 as well make it do your taxes, compute pi, play bridge, etc. You'd just 1659 as well make it do your taxes, compute pi, play bridge, etc. You'd just
1389 have to write functions to do those operations in Lisp. 1660 have to write functions to do those operations in Lisp.
1390 1661
1391 @node The Lisp Language, XEmacs From the Perspective of Building, XEmacs From the Outside, Top 1662 @node The Lisp Language, XEmacs from the Perspective of Building, XEmacs from the Outside, Top
1392 @chapter The Lisp Language 1663 @chapter The Lisp Language
1393 @cindex Lisp language, the 1664 @cindex Lisp language, the
1394 @cindex Lisp vs. C 1665 @cindex Lisp vs. C
1395 @cindex C vs. Lisp 1666 @cindex C vs. Lisp
1396 @cindex Lisp vs. Java 1667 @cindex Lisp vs. Java
1608 The word @dfn{application} in the previous paragraph was used 1879 The word @dfn{application} in the previous paragraph was used
1609 intentionally. XEmacs implements an API for programs written in Lisp 1880 intentionally. XEmacs implements an API for programs written in Lisp
1610 that makes it a full-fledged application platform, very much like an OS 1881 that makes it a full-fledged application platform, very much like an OS
1611 inside the real OS. 1882 inside the real OS.
1612 1883
1613 @node XEmacs From the Perspective of Building, Build-Time Dependencies, The Lisp Language, Top 1884 @node XEmacs from the Perspective of Building, Build-Time Dependencies, The Lisp Language, Top
1614 @chapter XEmacs From the Perspective of Building 1885 @chapter XEmacs from the Perspective of Building
1615 @cindex XEmacs from the perspective of building 1886 @cindex XEmacs from the perspective of building
1616 @cindex building, XEmacs from the perspective of 1887 @cindex building, XEmacs from the perspective of
1617 1888
1618 The heart of XEmacs is the Lisp environment, which is written in C. 1889 The heart of XEmacs is the Lisp environment, which is written in C.
1619 This is contained in the @file{src/} subdirectory. Underneath 1890 This is contained in the @file{src/} subdirectory. Underneath
1719 This is useful when the dumping procedure described above is broken, or 1990 This is useful when the dumping procedure described above is broken, or
1720 when using certain program debugging tools such as Purify. These tools 1991 when using certain program debugging tools such as Purify. These tools
1721 get mighty confused by the tricks played by the XEmacs build process, 1992 get mighty confused by the tricks played by the XEmacs build process,
1722 such as allocating memory in one process, and freeing it in the next. 1993 such as allocating memory in one process, and freeing it in the next.
1723 1994
1724 @node Build-Time Dependencies, XEmacs From the Inside, XEmacs From the Perspective of Building, Top 1995 @node Build-Time Dependencies, The Modules of XEmacs, XEmacs from the Perspective of Building, Top
1725 @chapter Build-Time Dependencies 1996 @chapter Build-Time Dependencies
1726 @cindex build-time dependencies 1997 @cindex build-time dependencies
1727 @cindex dependencies, build-time 1998 @cindex dependencies, build-time
1728 1999
1729 This is a collection of random notes on build-time dependencies as of 2000 This is a collection of random notes on build-time dependencies as of
1783 use any higher-level functionality that might load @file{custom.el}, but 2054 use any higher-level functionality that might load @file{custom.el}, but
1784 you do not need @file{subr.el}, you should @samp{defvar} 2055 you do not need @file{subr.el}, you should @samp{defvar}
1785 @code{custom-declare-variable-list} to prevent the @samp{void-variable} 2056 @code{custom-declare-variable-list} to prevent the @samp{void-variable}
1786 error. (Currently this is only needed for @file{make-docfile.el}.) 2057 error. (Currently this is only needed for @file{make-docfile.el}.)
1787 2058
1788 @node XEmacs From the Inside, The XEmacs Object System (Abstractly Speaking), Build-Time Dependencies, Top 2059 @node The Modules of XEmacs, Major Textual Changes, Build-Time Dependencies, Top
1789 @chapter XEmacs From the Inside
1790 @cindex XEmacs from the inside
1791 @cindex inside, XEmacs from the
1792
1793 Internally, XEmacs is quite complex, and can be very confusing. To
1794 simplify things, it can be useful to think of XEmacs as containing an
1795 event loop that ``drives'' everything, and a number of other subsystems,
1796 such as a Lisp engine and a redisplay mechanism. Each of these other
1797 subsystems exists simultaneously in XEmacs, and each has a certain
1798 state. The flow of control continually passes in and out of these
1799 different subsystems in the course of normal operation of the editor.
1800
1801 It is important to keep in mind that, most of the time, the editor is
1802 ``driven'' by the event loop. Except during initialization and batch
1803 mode, all subsystems are entered directly or indirectly through the
1804 event loop, and ultimately, control exits out of all subsystems back up
1805 to the event loop. This cycle of entering a subsystem, exiting back out
1806 to the event loop, and starting another iteration of the event loop
1807 occurs once each keystroke, mouse motion, etc.
1808
1809 If you're trying to understand a particular subsystem (other than the
1810 event loop), think of it as a ``daemon'' process or ``servant'' that is
1811 responsible for one particular aspect of a larger system, and
1812 periodically receives commands or environment changes that cause it to
1813 do something. Ultimately, these commands and environment changes are
1814 always triggered by the event loop. For example:
1815
1816 @itemize @bullet
1817 @item
1818 The window and frame mechanism is responsible for keeping track of what
1819 windows and frames exist, what buffers are in them, etc. It is
1820 periodically given commands (usually from the user) to make a change to
1821 the current window/frame state: i.e. create a new frame, delete a
1822 window, etc.
1823
1824 @item
1825 The buffer mechanism is responsible for keeping track of what buffers
1826 exist and what text is in them. It is periodically given commands
1827 (usually from the user) to insert or delete text, create a buffer, etc.
1828 When it receives a text-change command, it notifies the redisplay
1829 mechanism.
1830
1831 @item
1832 The redisplay mechanism is responsible for making sure that windows and
1833 frames are displayed correctly. It is periodically told (by the event
1834 loop) to actually ``do its job'', i.e. snoop around and see what the
1835 current state of the environment (mostly of the currently-existing
1836 windows, frames, and buffers) is, and make sure that state matches
1837 what's actually displayed. It keeps lots and lots of information around
1838 (such as what is actually being displayed currently, and what the
1839 environment was last time it checked) so that it can minimize the work
1840 it has to do. It is also helped along in that whenever a relevant
1841 change to the environment occurs, the redisplay mechanism is told about
1842 this, so it has a pretty good idea of where it has to look to find
1843 possible changes and doesn't have to look everywhere.
1844
1845 @item
1846 The Lisp engine is responsible for executing the Lisp code in which most
1847 user commands are written. It is entered through a call to @code{eval}
1848 or @code{funcall}, which occurs as a result of dispatching an event from
1849 the event loop. The functions it calls issue commands to the buffer
1850 mechanism, the window/frame subsystem, etc.
1851
1852 @item
1853 The Lisp allocation subsystem is responsible for keeping track of Lisp
1854 objects. It is given commands from the Lisp engine to allocate objects,
1855 garbage collect, etc.
1856 @end itemize
1857
1858 etc.
1859
1860 The important idea here is that there are a number of independent
1861 subsystems each with its own responsibility and persistent state, just
1862 like different employees in a company, and each subsystem is
1863 periodically given commands from other subsystems. Commands can flow
1864 from any one subsystem to any other, but there is usually some sort of
1865 hierarchy, with all commands originating from the event subsystem.
1866
1867 XEmacs is entered in @code{main()}, which is in @file{emacs.c}. When
1868 this is called the first time (in a properly-invoked @file{temacs}), it
1869 does the following:
1870
1871 @enumerate
1872 @item
1873 It does some very basic environment initializations, such as determining
1874 where it and its directories (e.g. @file{lisp/} and @file{etc/}) reside
1875 and setting up signal handlers.
1876 @item
1877 It initializes the entire Lisp interpreter.
1878 @item
1879 It sets the initial values of many built-in variables (including many
1880 variables that are visible to Lisp programs), such as the global keymap
1881 object and the built-in faces (a face is an object that describes the
1882 display characteristics of text). This involves creating Lisp objects
1883 and thus is dependent on step (2).
1884 @item
1885 It performs various other initializations that are relevant to the
1886 particular environment it is running in, such as retrieving environment
1887 variables, determining the current date and the user who is running the
1888 program, examining its standard input, creating any necessary file
1889 descriptors, etc.
1890 @item
1891 At this point, the C initialization is complete. A Lisp program that
1892 was specified on the command line (usually @file{loadup.el}) is called
1893 (temacs is normally invoked as @code{temacs -batch -l loadup.el dump}).
1894 @file{loadup.el} loads all of the other Lisp files that are needed for
1895 the operation of the editor, calls the @code{dump-emacs} function to
1896 write out @file{xemacs}, and then kills the temacs process.
1897 @end enumerate
1898
1899 When @file{xemacs} is then run, it only redoes steps (1) and (4)
1900 above; all variables already contain the values they were set to when
1901 the executable was dumped, and all memory that was allocated with
1902 @code{malloc()} is still around. (XEmacs knows whether it is being run
1903 as @file{xemacs} or @file{temacs} because it sets the global variable
1904 @code{initialized} to 1 after step (4) above.) At this point,
1905 @file{xemacs} calls a Lisp function to do any further initialization,
1906 which includes parsing the command-line (the C code can only do limited
1907 command-line parsing, which includes looking for the @samp{-batch} and
1908 @samp{-l} flags and a few other flags that it needs to know about before
1909 initialization is complete), creating the first frame (or @dfn{window}
1910 in standard window-system parlance), running the user's init file
1911 (usually the file @file{.emacs} in the user's home directory), etc. The
1912 function to do this is usually called @code{normal-top-level};
1913 @file{loadup.el} tells the C code about this function by setting its
1914 name as the value of the Lisp variable @code{top-level}.
1915
1916 When the Lisp initialization code is done, the C code enters the event
1917 loop, and stays there for the duration of the XEmacs process. The code
1918 for the event loop is contained in @file{cmdloop.c}, and is called
1919 @code{Fcommand_loop_1()}. Note that this event loop could very well be
1920 written in Lisp, and in fact a Lisp version exists; but apparently,
1921 doing this makes XEmacs run noticeably slower.
1922
1923 Notice how much of the initialization is done in Lisp, not in C.
1924 In general, XEmacs tries to move as much code as is possible
1925 into Lisp. Code that remains in C is code that implements the
1926 Lisp interpreter itself, or code that needs to be very fast, or
1927 code that needs to do system calls or other such stuff that
1928 needs to be done in C, or code that needs to have access to
1929 ``forbidden'' structures. (One conscious aspect of the design of
1930 Lisp under XEmacs is a clean separation between the external
1931 interface to a Lisp object's functionality and its internal
1932 implementation. Part of this design is that Lisp programs
1933 are forbidden from accessing the contents of the object other
1934 than through using a standard API. In this respect, XEmacs Lisp
1935 is similar to modern Lisp dialects but differs from GNU Emacs,
1936 which tends to expose the implementation and allow Lisp
1937 programs to look at it directly. The major advantage of
1938 hiding the implementation is that it allows the implementation
1939 to be redesigned without affecting any Lisp programs, including
1940 those that might want to be ``clever'' by looking directly at
1941 the object's contents and possibly manipulating them.)
1942
1943 Moving code into Lisp makes the code easier to debug and maintain and
1944 makes it much easier for people who are not XEmacs developers to
1945 customize XEmacs, because they can make a change with much less chance
1946 of obscure and unwanted interactions occurring than if they were to
1947 change the C code.
1948
1949 @node The XEmacs Object System (Abstractly Speaking), How Lisp Objects Are Represented in C, XEmacs From the Inside, Top
1950 @chapter The XEmacs Object System (Abstractly Speaking)
1951 @cindex XEmacs object system (abstractly speaking), the
1952 @cindex object system (abstractly speaking), the XEmacs
1953
1954 At the heart of the Lisp interpreter is its management of objects.
1955 XEmacs Lisp contains many built-in objects, some of which are
1956 simple and others of which can be very complex; and some of which
1957 are very common, and others of which are rarely used or are only
1958 used internally. (Since the Lisp allocation system, with its
1959 automatic reclamation of unused storage, is so much more convenient
1960 than @code{malloc()} and @code{free()}, the C code makes extensive use of it
1961 in its internal operations.)
1962
1963 The basic Lisp objects are
1964
1965 @table @code
1966 @item integer
1967 31 bits of precision, or 63 bits on 64-bit machines; the
1968 reason for this is described below when the internal Lisp object
1969 representation is described.
1970 @item char
1971 An object representing a single character of text; chars behave like
1972 integers in many ways but are logically considered text rather than
1973 numbers and have a different read syntax. (the read syntax for a char
1974 contains the char itself or some textual encoding of it---for example,
1975 a Japanese Kanji character might be encoded as @samp{^[$(B#&^[(B} using the
1976 ISO-2022 encoding standard---rather than the numerical representation
1977 of the char; this way, if the mapping between chars and integers
1978 changes, which is quite possible for Kanji characters and other extended
1979 characters, the same character will still be created. Note that some
1980 primitives confuse chars and integers. The worst culprit is @code{eq},
1981 which makes a special exception and considers a char to be @code{eq} to
1982 its integer equivalent, even though in no other case are objects of two
1983 different types @code{eq}. The reason for this monstrosity is
1984 compatibility with existing code; the separation of char from integer
1985 came fairly recently.)
1986 @item float
1987 Same precision as a double in C.
1988 @item bignum
1989 @itemx ratio
1990 @itemx bigfloat
1991 As build-time options, arbitrary-precision numbers are available.
1992 Bignums are integers, and when available they remove the restriction on
1993 buffer size. Ratios are non-integral rational numbers. Bigfloats are
1994 arbitrary-precision floating point numbers, with precision specified at
1995 runtime.
1996 @item symbol
1997 An object that contains Lisp objects and is referred to by name;
1998 symbols are used to implement variables and named functions
1999 and to provide the equivalent of preprocessor constants in C.
2000 @item string
2001 Self-explanatory; behaves much like a vector of chars
2002 but has a different read syntax and is stored and manipulated
2003 more compactly.
2004 @item bit-vector
2005 A vector of bits; similar to a string in spirit.
2006 @item vector
2007 A one-dimensional array of Lisp objects providing constant-time access
2008 to any of the objects; access to an arbitrary object in a vector is
2009 faster than for lists, but the operations that can be done on a vector
2010 are more limited.
2011 @item compiled-function
2012 An object containing compiled Lisp code, known as @dfn{byte code}.
2013 @item subr
2014 A Lisp primitive, i.e. a Lisp-callable function implemented in C.
2015 @item cons
2016 A simple container for two Lisp objects, used to implement lists and
2017 most other data structures in Lisp.
2018 @end table
2019
2020 Objects which are not conses are called atoms.
2021
2022 @cindex closure
2023 Note that there is no basic ``function'' type, as in more powerful
2024 versions of Lisp (where it's called a @dfn{closure}). XEmacs Lisp does
2025 not provide the closure semantics implemented by Common Lisp and Scheme.
2026 The guts of a function in XEmacs Lisp are represented in one of four
2027 ways: a symbol specifying another function (when one function is an
2028 alias for another), a list (whose first element must be the symbol
2029 @code{lambda}) containing the function's source code, a
2030 compiled-function object, or a subr object. (In other words, given a
2031 symbol specifying the name of a function, calling @code{symbol-function}
2032 to retrieve the contents of the symbol's function cell will return one
2033 of these types of objects.)
2034
2035 XEmacs Lisp also contains numerous specialized objects used to implement
2036 the editor:
2037
2038 @table @code
2039 @item buffer
2040 Stores text like a string, but is optimized for insertion and deletion
2041 and has certain other properties that can be set.
2042 @item frame
2043 An object with various properties whose displayable representation is a
2044 @dfn{window} in window-system parlance.
2045 @item window
2046 A section of a frame that displays the contents of a buffer;
2047 often called a @dfn{pane} in window-system parlance.
2048 @item window-configuration
2049 An object that represents a saved configuration of windows in a frame.
2050 @item device
2051 An object representing a screen on which frames can be displayed;
2052 equivalent to a @dfn{display} in the X Window System and a @dfn{TTY} in
2053 character mode.
2054 @item face
2055 An object specifying the appearance of text or graphics; it has
2056 properties such as font, foreground color, and background color.
2057 @item marker
2058 An object that refers to a particular position in a buffer and moves
2059 around as text is inserted and deleted to stay in the same relative
2060 position to the text around it.
2061 @item extent
2062 Similar to a marker but covers a range of text in a buffer; can also
2063 specify properties of the text, such as a face in which the text is to
2064 be displayed, whether the text is invisible or unmodifiable, etc.
2065 @item event
2066 Generated by calling @code{next-event} and contains information
2067 describing a particular event happening in the system, such as the user
2068 pressing a key or a process terminating.
2069 @item keymap
2070 An object that maps from events (described using lists, vectors, and
2071 symbols rather than with an event object because the mapping is for
2072 classes of events, rather than individual events) to functions to
2073 execute or other events to recursively look up; the functions are
2074 described by name, using a symbol, or using lists to specify the
2075 function's code.
2076 @item glyph
2077 An object that describes the appearance of an image (e.g. pixmap) on
2078 the screen; glyphs can be attached to the beginning or end of extents
2079 and in some future version of XEmacs will be able to be inserted
2080 directly into a buffer.
2081 @item process
2082 An object that describes a connection to an externally-running process.
2083 @end table
2084
2085 There are some other, less-commonly-encountered general objects:
2086
2087 @table @code
2088 @item hash-table
2089 An object that maps from an arbitrary Lisp object to another arbitrary
2090 Lisp object, using hashing for fast lookup.
2091 @item obarray
2092 A limited form of hash-table that maps from strings to symbols; obarrays
2093 are used to look up a symbol given its name and are not actually their
2094 own object type but are kludgily represented using vectors with hidden
2095 fields (this representation derives from GNU Emacs).
2096 @item specifier
2097 A complex object used to specify the value of a display property; a
2098 default value is given and different values can be specified for
2099 particular frames, buffers, windows, devices, or classes of device.
2100 @item char-table
2101 An object that maps from chars or classes of chars to arbitrary Lisp
2102 objects; internally char tables use a complex nested-vector
2103 representation that is optimized to the way characters are represented
2104 as integers.
2105 @item range-table
2106 An object that maps from ranges of integers to arbitrary Lisp objects.
2107 @end table
2108
2109 And some strange special-purpose objects:
2110
2111 @table @code
2112 @item charset
2113 @itemx coding-system
2114 Objects used when MULE, or multi-lingual/Asian-language, support is
2115 enabled.
2116 @item color-instance
2117 @itemx font-instance
2118 @itemx image-instance
2119 An object that encapsulates a window-system resource; instances are
2120 mostly used internally but are exposed on the Lisp level for cleanness
2121 of the specifier model and because it's occasionally useful for Lisp
2122 program to create or query the properties of instances.
2123 @item subwindow
2124 An object that encapsulate a @dfn{subwindow} resource, i.e. a
2125 window-system child window that is drawn into by an external process;
2126 this object should be integrated into the glyph system but isn't yet,
2127 and may change form when this is done.
2128 @item tooltalk-message
2129 @itemx tooltalk-pattern
2130 Objects that represent resources used in the ToolTalk interprocess
2131 communication protocol.
2132 @item toolbar-button
2133 An object used in conjunction with the toolbar.
2134 @end table
2135
2136 And objects that are only used internally:
2137
2138 @table @code
2139 @item opaque
2140 A generic object for encapsulating arbitrary memory; this allows you the
2141 generality of @code{malloc()} and the convenience of the Lisp object
2142 system.
2143 @item lstream
2144 A buffering I/O stream, used to provide a unified interface to anything
2145 that can accept output or provide input, such as a file descriptor, a
2146 stdio stream, a chunk of memory, a Lisp buffer, a Lisp string, etc.;
2147 it's a Lisp object to make its memory management more convenient.
2148 @item char-table-entry
2149 Subsidiary objects in the internal char-table representation.
2150 @item extent-auxiliary
2151 @itemx menubar-data
2152 @itemx toolbar-data
2153 Various special-purpose objects that are basically just used to
2154 encapsulate memory for particular subsystems, similar to the more
2155 general ``opaque'' object.
2156 @item symbol-value-forward
2157 @itemx symbol-value-buffer-local
2158 @itemx symbol-value-varalias
2159 @itemx symbol-value-lisp-magic
2160 Special internal-only objects that are placed in the value cell of a
2161 symbol to indicate that there is something special with this variable --
2162 e.g. it has no value, it mirrors another variable, or it mirrors some C
2163 variable; there is really only one kind of object, called a
2164 @dfn{symbol-value-magic}, but it is sort-of halfway kludged into
2165 semi-different object types.
2166 @end table
2167
2168 @cindex permanent objects
2169 @cindex temporary objects
2170 Some types of objects are @dfn{permanent}, meaning that once created,
2171 they do not disappear until explicitly destroyed, using a function such
2172 as @code{delete-buffer}, @code{delete-window}, @code{delete-frame}, etc.
2173 Others will disappear once they are not longer used, through the garbage
2174 collection mechanism. Buffers, frames, windows, devices, and processes
2175 are among the objects that are permanent. Note that some objects can go
2176 both ways: Faces can be created either way; extents are normally
2177 permanent, but detached extents (extents not referring to any text, as
2178 happens to some extents when the text they are referring to is deleted)
2179 are temporary. Note that some permanent objects, such as faces and
2180 coding systems, cannot be deleted. Note also that windows are unique in
2181 that they can be @emph{undeleted} after having previously been
2182 deleted. (This happens as a result of restoring a window configuration.)
2183
2184 @cindex read syntax
2185 Many types of objects have a @dfn{read syntax}, i.e. a way of
2186 specifying an object of that type in Lisp code. When you load a Lisp
2187 file, or type in code to be evaluated, what really happens is that the
2188 function @code{read} is called, which reads some text and creates an object
2189 based on the syntax of that text; then @code{eval} is called, which
2190 possibly does something special; then this loop repeats until there's
2191 no more text to read. (@code{eval} only actually does something special
2192 with symbols, which causes the symbol's value to be returned,
2193 similar to referencing a variable; and with conses [i.e. lists],
2194 which cause a function invocation. All other values are returned
2195 unchanged.)
2196
2197 The read syntax
2198
2199 @example
2200 17297
2201 @end example
2202
2203 converts to an integer whose value is 17297.
2204
2205 @example
2206 355/113
2207 @end example
2208
2209 converts to a ratio commonly used to approximate @emph{pi} when ratios
2210 are configured, and otherwise to a symbol whose name is ``355/113'' (for
2211 backward compatibility).
2212
2213 @example
2214 1.983e-4
2215 @end example
2216
2217 converts to a float whose value is 1.983e-4, or .0001983.
2218
2219 @example
2220 ?b
2221 @end example
2222
2223 converts to a char that represents the lowercase letter b.
2224
2225 @example
2226 ?^[$(B#&^[(B
2227 @end example
2228
2229 (where @samp{^[} actually is an @samp{ESC} character) converts to a
2230 particular Kanji character when using an ISO2022-based coding system for
2231 input. (To decode this goo: @samp{ESC} begins an escape sequence;
2232 @samp{ESC $ (} is a class of escape sequences meaning ``switch to a
2233 94x94 character set''; @samp{ESC $ ( B} means ``switch to Japanese
2234 Kanji''; @samp{#} and @samp{&} collectively index into a 94-by-94 array
2235 of characters [subtract 33 from the ASCII value of each character to get
2236 the corresponding index]; @samp{ESC (} is a class of escape sequences
2237 meaning ``switch to a 94 character set''; @samp{ESC (B} means ``switch
2238 to US ASCII''. It is a coincidence that the letter @samp{B} is used to
2239 denote both Japanese Kanji and US ASCII. If the first @samp{B} were
2240 replaced with an @samp{A}, you'd be requesting a Chinese Hanzi character
2241 from the GB2312 character set.)
2242
2243 @example
2244 "foobar"
2245 @end example
2246
2247 converts to a string.
2248
2249 @example
2250 foobar
2251 @end example
2252
2253 converts to a symbol whose name is @code{"foobar"}. This is done by
2254 looking up the string equivalent in the global variable
2255 @code{obarray}, whose contents should be an obarray. If no symbol
2256 is found, a new symbol with the name @code{"foobar"} is automatically
2257 created and added to @code{obarray}; this process is called
2258 @dfn{interning} the symbol.
2259 @cindex interning
2260
2261 @example
2262 (foo . bar)
2263 @end example
2264
2265 converts to a cons cell containing the symbols @code{foo} and @code{bar}.
2266
2267 @example
2268 (1 a 2.5)
2269 @end example
2270
2271 converts to a three-element list containing the specified objects
2272 (note that a list is actually a set of nested conses; see the
2273 XEmacs Lisp Reference).
2274
2275 @example
2276 [1 a 2.5]
2277 @end example
2278
2279 converts to a three-element vector containing the specified objects.
2280
2281 @example
2282 #[... ... ... ...]
2283 @end example
2284
2285 converts to a compiled-function object (the actual contents are not
2286 shown since they are not relevant here; look at a file that ends with
2287 @file{.elc} for examples).
2288
2289 @example
2290 #*01110110
2291 @end example
2292
2293 converts to a bit-vector.
2294
2295 @example
2296 #s(hash-table ... ...)
2297 @end example
2298
2299 converts to a hash table (the actual contents are not shown).
2300
2301 @example
2302 #s(range-table ... ...)
2303 @end example
2304
2305 converts to a range table (the actual contents are not shown).
2306
2307 @example
2308 #s(char-table ... ...)
2309 @end example
2310
2311 converts to a char table (the actual contents are not shown).
2312
2313 Note that the @code{#s()} syntax is the general syntax for structures,
2314 which are not really implemented in XEmacs Lisp but should be.
2315
2316 When an object is printed out (using @code{print} or a related
2317 function), the read syntax is used, so that the same object can be read
2318 in again.
2319
2320 The other objects do not have read syntaxes, usually because it does not
2321 really make sense to create them in this fashion (i.e. processes, where
2322 it doesn't make sense to have a subprocess created as a side effect of
2323 reading some Lisp code), or because they can't be created at all
2324 (e.g. subrs). Permanent objects, as a rule, do not have a read syntax;
2325 nor do most complex objects, which contain too much state to be easily
2326 initialized through a read syntax.
2327
2328 @node How Lisp Objects Are Represented in C, Major Textual Changes, The XEmacs Object System (Abstractly Speaking), Top
2329 @chapter How Lisp Objects Are Represented in C
2330 @cindex Lisp objects are represented in C, how
2331 @cindex objects are represented in C, how Lisp
2332 @cindex represented in C, how Lisp objects are
2333
2334 Lisp objects are represented in C using a 32-bit or 64-bit machine word
2335 (depending on the processor; i.e. DEC Alphas use 64-bit Lisp objects and
2336 most other processors use 32-bit Lisp objects). The representation
2337 stuffs a pointer together with a tag, as follows:
2338
2339 @example
2340 [ 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ]
2341 [ 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ]
2342
2343 <---------------------------------------------------------> <->
2344 a pointer to a structure, or an integer tag
2345 @end example
2346
2347 A tag of 00 is used for all pointer object types, a tag of 10 is used
2348 for characters, and the other two tags 01 and 11 are joined together to
2349 form the integer object type. This representation gives us 31 bit
2350 integers and 30 bit characters, while pointers are represented directly
2351 without any bit masking or shifting. This representation, though,
2352 assumes that pointers to structs are always aligned to multiples of 4,
2353 so the lower 2 bits are always zero.
2354
2355 Lisp objects use the typedef @code{Lisp_Object}, but the actual C type
2356 used for the Lisp object can vary. It can be either a simple type
2357 (@code{long} on the DEC Alpha, @code{int} on other machines) or a
2358 structure whose fields are bit fields that line up properly (actually, a
2359 union of structures is used). Generally the simple integral type is
2360 preferable because it ensures that the compiler will actually use a
2361 machine word to represent the object (some compilers will use more
2362 general and less efficient code for unions and structs even if they can
2363 fit in a machine word). The union type, however, has the advantage of
2364 stricter type checking. If you accidentally pass an integer where a Lisp
2365 object is desired, you get a compile error. The choice of which type
2366 to use is determined by the preprocessor constant @code{USE_UNION_TYPE}
2367 which is defined via the @code{--use-union-type} option to
2368 @code{configure}.
2369
2370 Various macros are used to convert between Lisp_Objects and the
2371 corresponding C type. Macros of the form @code{XINT()}, @code{XCHAR()},
2372 @code{XSTRING()}, @code{XSYMBOL()}, do any required bit shifting and/or
2373 masking and cast it to the appropriate type. @code{XINT()} needs to be
2374 a bit tricky so that negative numbers are properly sign-extended. Since
2375 integers are stored left-shifted, if the right-shift operator does an
2376 arithmetic shift (i.e. it leaves the most-significant bit as-is rather
2377 than shifting in a zero, so that it mimics a divide-by-two even for
2378 negative numbers) the shift to remove the tag bit is enough. This is
2379 the case on all the systems we support.
2380
2381 Note that when @code{ERROR_CHECK_TYPECHECK} is defined, the converter
2382 macros become more complicated---they check the tag bits and/or the
2383 type field in the first four bytes of a record type to ensure that the
2384 object is really of the correct type. This is great for catching places
2385 where an incorrect type is being dereferenced---this typically results
2386 in a pointer being dereferenced as the wrong type of structure, with
2387 unpredictable (and sometimes not easily traceable) results.
2388
2389 There are similar @code{XSET@var{TYPE}()} macros that construct a Lisp
2390 object. These macros are of the form @code{XSET@var{TYPE}
2391 (@var{lvalue}, @var{result})}, i.e. they have to be a statement rather
2392 than just used in an expression. The reason for this is that standard C
2393 doesn't let you ``construct'' a structure (but GCC does). Granted, this
2394 sometimes isn't too convenient; for the case of integers, at least, you
2395 can use the function @code{make_int()}, which constructs and
2396 @emph{returns} an integer Lisp object. Note that the
2397 @code{XSET@var{TYPE}()} macros are also affected by
2398 @code{ERROR_CHECK_TYPECHECK} and make sure that the structure is of the
2399 right type in the case of record types, where the type is contained in
2400 the structure.
2401
2402 The C programmer is responsible for @strong{guaranteeing} that a
2403 Lisp_Object is the correct type before using the @code{X@var{TYPE}}
2404 macros. This is especially important in the case of lists. Use
2405 @code{XCAR} and @code{XCDR} if a Lisp_Object is certainly a cons cell,
2406 else use @code{Fcar()} and @code{Fcdr()}. Trust other C code, but not
2407 Lisp code. On the other hand, if XEmacs has an internal logic error,
2408 it's better to crash immediately, so sprinkle @code{assert()}s and
2409 ``unreachable'' @code{abort()}s liberally about the source code. Where
2410 performance is an issue, use @code{type_checking_assert},
2411 @code{bufpos_checking_assert}, and @code{gc_checking_assert}, which do
2412 nothing unless the corresponding configure error checking flag was
2413 specified.
2414
2415 @node Major Textual Changes, Rules When Writing New C Code, How Lisp Objects Are Represented in C, Top
2416 @chapter Major Textual Changes
2417 @cindex textual changes, major
2418 @cindex major textual changes
2419
2420 Sometimes major textual changes are made to the source. This means that
2421 a search-and-replace is done to change type names and such. Some people
2422 disagree with such changes, and certainly if done without good reason
2423 will just lead to headaches. But it's important to keep the code clean
2424 and understable, and consistent naming goes a long way towards this.
2425
2426 An example of the right way to do this was the so-called "great integral
2427 type renaming".
2428
2429 @menu
2430 * Great Integral Type Renaming::
2431 * Text/Char Type Renaming::
2432 @end menu
2433
2434 @node Great Integral Type Renaming, Text/Char Type Renaming, Major Textual Changes, Major Textual Changes
2435 @section Great Integral Type Renaming
2436 @cindex Great Integral Type Renaming
2437 @cindex integral type renaming, great
2438 @cindex type renaming, integral
2439 @cindex renaming, integral types
2440
2441 The purpose of this is to rationalize the names used for various
2442 integral types, so that they match their intended uses and follow
2443 consist conventions, and eliminate types that were not semantically
2444 different from each other.
2445
2446 The conventions are:
2447
2448 @itemize @bullet
2449 @item
2450 All integral types that measure quantities of anything are signed. Some
2451 people disagree vociferously with this, but their arguments are mostly
2452 theoretical, and are vastly outweighed by the practical headaches of
2453 mixing signed and unsigned values, and more importantly by the far
2454 increased likelihood of inadvertent bugs: Because of the broken "viral"
2455 nature of unsigned quantities in C (operations involving mixed
2456 signed/unsigned are done unsigned, when exactly the opposite is nearly
2457 always wanted), even a single error in declaring a quantity unsigned
2458 that should be signed, or even the even more subtle error of comparing
2459 signed and unsigned values and forgetting the necessary cast, can be
2460 catastrophic, as comparisons will yield wrong results. -Wsign-compare
2461 is turned on specifically to catch this, but this tends to result in a
2462 great number of warnings when mixing signed and unsigned, and the casts
2463 are annoying. More has been written on this elsewhere.
2464
2465 @item
2466 All such quantity types just mentioned boil down to EMACS_INT, which is
2467 32 bits on 32-bit machines and 64 bits on 64-bit machines. This is
2468 guaranteed to be the same size as Lisp objects of type @code{int}, and (as
2469 far as I can tell) of size_t (unsigned!) and ssize_t. The only type
2470 below that is not an EMACS_INT is Hashcode, which is an unsigned value
2471 of the same size as EMACS_INT.
2472
2473 @item
2474 Type names should be relatively short (no more than 10 characters or
2475 so), with the first letter capitalized and no underscores if they can at
2476 all be avoided.
2477
2478 @item
2479 "count" == a zero-based measurement of some quantity. Includes sizes,
2480 offsets, and indexes.
2481
2482 @item
2483 "bpos" == a one-based measurement of a position in a buffer. "Charbpos"
2484 and "Bytebpos" count text in the buffer, rather than bytes in memory;
2485 thus Bytebpos does not directly correspond to the memory representation.
2486 Use "Membpos" for this.
2487
2488 @item
2489 "Char" refers to internal-format characters, not to the C type "char",
2490 which is really a byte.
2491 @end itemize
2492
2493 For the actual name changes, see the script below.
2494
2495 I ran the following script to do the conversion. (NOTE: This script is
2496 idempotent. You can safely run it multiple times and it will not screw
2497 up previous results -- in fact, it will do nothing if nothing has
2498 changed. Thus, it can be run repeatedly as necessary to handle patches
2499 coming in from old workspaces, or old branches.) There are two tags,
2500 just before and just after the change: @samp{pre-integral-type-rename}
2501 and @samp{post-integral-type-rename}. When merging code from the main
2502 trunk into a branch, the best thing to do is first merge up to
2503 @samp{pre-integral-type-rename}, then apply the script and associated
2504 changes, then merge from @samp{post-integral-type-change} to the
2505 present. (Alternatively, just do the merging in one operation; but you
2506 may then have a lot of conflicts needing to be resolved by hand.)
2507
2508 Script @samp{fixtypes.sh} follows:
2509
2510 @example
2511 ----------------------------------- cut ------------------------------------
2512 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
2513 gr Memory_Count Bytecount $files
2514 gr Lstream_Data_Count Bytecount $files
2515 gr Element_Count Elemcount $files
2516 gr Hash_Code Hashcode $files
2517 gr extcount bytecount $files
2518 gr bufpos charbpos $files
2519 gr bytind bytebpos $files
2520 gr memind membpos $files
2521 gr bufbyte intbyte $files
2522 gr Extcount Bytecount $files
2523 gr Bufpos Charbpos $files
2524 gr Bytind Bytebpos $files
2525 gr Memind Membpos $files
2526 gr Bufbyte Intbyte $files
2527 gr EXTCOUNT BYTECOUNT $files
2528 gr BUFPOS CHARBPOS $files
2529 gr BYTIND BYTEBPOS $files
2530 gr MEMIND MEMBPOS $files
2531 gr BUFBYTE INTBYTE $files
2532 gr MEMORY_COUNT BYTECOUNT $files
2533 gr LSTREAM_DATA_COUNT BYTECOUNT $files
2534 gr ELEMENT_COUNT ELEMCOUNT $files
2535 gr HASH_CODE HASHCODE $files
2536 ----------------------------------- cut ------------------------------------
2537 @end example
2538
2539 The @samp{gr} script, and the scripts it uses, are documented in
2540 @file{README.global-renaming}, because if placed in this file they would
2541 need to have their @@ characters doubled, meaning you couldn't easily
2542 cut and paste from the source.
2543
2544 In addition to those programs, I needed to fix up a few other
2545 things, particularly relating to the duplicate definitions of
2546 types, now that some types merged with others. Specifically:
2547
2548 @enumerate
2549 @item
2550 in @file{lisp.h}, removed duplicate declarations of Bytecount. The changed
2551 code should now look like this: (In each code snippet below, the first
2552 and last lines are the same as the original, as are all lines outside of
2553 those lines. That allows you to locate the section to be replaced, and
2554 replace the stuff in that section, verifying that there isn't anything
2555 new added that would need to be kept.)
2556
2557 @example
2558 --------------------------------- snip -------------------------------------
2559 /* Counts of bytes or chars */
2560 typedef EMACS_INT Bytecount;
2561 typedef EMACS_INT Charcount;
2562
2563 /* Counts of elements */
2564 typedef EMACS_INT Elemcount;
2565
2566 /* Hash codes */
2567 typedef unsigned long Hashcode;
2568
2569 /* ------------------------ dynamic arrays ------------------- */
2570 --------------------------------- snip -------------------------------------
2571 @end example
2572
2573 @item
2574 in @file{lstream.h}, removed duplicate declaration of Bytecount. Rewrote the
2575 comment about this type. The changed code should now look like this:
2576
2577 @example
2578 --------------------------------- snip -------------------------------------
2579 #endif
2580
2581 /* The have been some arguments over the what the type should be that
2582 specifies a count of bytes in a data block to be written out or read in,
2583 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions.
2584 Originally it was long, which worked fine; Martin "corrected" these to
2585 size_t and ssize_t on the grounds that this is theoretically cleaner and
2586 is in keeping with the C standards. Unfortunately, this practice is
2587 horribly error-prone due to design flaws in the way that mixed
2588 signed/unsigned arithmetic happens. In fact, by doing this change,
2589 Martin introduced a subtle but fatal error that caused the operation of
2590 sending large mail messages to the SMTP server under Windows to fail.
2591 By putting all values back to be signed, avoiding any signed/unsigned
2592 mixing, the bug immediately went away. The type then in use was
2593 Lstream_Data_Count, so that it be reverted cleanly if a vote came to
2594 that. Now it is Bytecount.
2595
2596 Some earlier comments about why the type must be signed: This MUST BE
2597 SIGNED, since it also is used in functions that return the number of
2598 bytes actually read to or written from in an operation, and these
2599 functions can return -1 to signal error.
2600
2601 Note that the standard Unix @code{read()} and @code{write()} functions define the
2602 count going in as a size_t, which is UNSIGNED, and the count going
2603 out as an ssize_t, which is SIGNED. This is a horrible design
2604 flaw. Not only is it highly likely to lead to logic errors when a
2605 -1 gets interpreted as a large positive number, but operations are
2606 bound to fail in all sorts of horrible ways when a number in the
2607 upper-half of the size_t range is passed in -- this number is
2608 unrepresentable as an ssize_t, so code that checks to see how many
2609 bytes are actually written (which is mandatory if you are dealing
2610 with certain types of devices) will get completely screwed up.
2611
2612 --ben
2613 */
2614
2615 typedef enum lstream_buffering
2616 --------------------------------- snip -------------------------------------
2617 @end example
2618
2619 @item
2620 in @file{dumper.c}, there are four places, all inside of @code{switch()} statements,
2621 where XD_BYTECOUNT appears twice as a case tag. In each case, the two
2622 case blocks contain identical code, and you should *REMOVE THE SECOND*
2623 and leave the first.
2624 @end enumerate
2625
2626 @node Text/Char Type Renaming, , Great Integral Type Renaming, Major Textual Changes
2627 @section Text/Char Type Renaming
2628 @cindex Text/Char Type Renaming
2629 @cindex type renaming, text/char
2630 @cindex renaming, text/char types
2631
2632 The purpose of this was
2633
2634 @enumerate
2635 @item
2636 To distinguish between ``charptr'' when it refers to operations on
2637 the pointer itself and when it refers to operations on text
2638 @item
2639 To use consistent naming for everything referring to internal format, i.e.
2640 @end enumerate
2641
2642 @example
2643 Itext == text in internal format
2644 Ibyte == a byte in such text
2645 Ichar == a char as represented in internal character format
2646 @end example
2647
2648 Thus e.g.
2649
2650 @example
2651 set_charptr_emchar -> set_itext_ichar
2652 @end example
2653
2654 This was done using a script like this:
2655
2656 @example
2657 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
2658 gr Intbyte Ibyte $files
2659 gr INTBYTE IBYTE $files
2660 gr intbyte ibyte $files
2661 gr EMCHAR ICHAR $files
2662 gr emchar ichar $files
2663 gr Emchar Ichar $files
2664 gr INC_CHARPTR INC_IBYTEPTR $files
2665 gr DEC_CHARPTR DEC_IBYTEPTR $files
2666 gr VALIDATE_CHARPTR VALIDATE_IBYTEPTR $files
2667 gr valid_charptr valid_ibyteptr $files
2668 gr CHARPTR ITEXT $files
2669 gr charptr itext $files
2670 gr Charptr Itext $files
2671 @end example
2672
2673 See above for the source to @samp{gr}.
2674
2675 As in the integral-types change, there are pre and post tags before and
2676 after the change:
2677
2678 @example
2679 pre-internal-format-textual-renaming
2680 post-internal-format-textual-renaming
2681 @end example
2682
2683 When merging a large branch, follow the same sort of procedure
2684 documented above, using these tags -- essentially sync up to the pre
2685 tag, then apply the script yourself, then sync from the post tag to the
2686 present. You can probably do the same if you don't have a separate
2687 workspace, but do have lots of outstanding changes and you'd rather not
2688 just merge all the textual changes directly. Use something like this:
2689
2690 (WARNING: I'm not a CVS guru; before trying this, or any large operation
2691 that might potentially mess things up, @strong{DEFINITELY} make a backup of
2692 your existing workspace.)
2693
2694 @example
2695 cup -r pre-internal-format-textual-renaming
2696 <apply script>
2697 cup -A -j post-internal-format-textual-renaming -j HEAD
2698 @end example
2699
2700 This might also work:
2701
2702 @example
2703 cup -j pre-internal-format-textual-renaming
2704 <apply script>
2705 cup -j post-internal-format-textual-renaming -j HEAD
2706 @end example
2707
2708 ben
2709
2710 The following is a script to go in the opposite direction:
2711
2712 @example
2713 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
2714
2715 # Evidently Perl considers _ to be a word char ala \b, even though XEmacs
2716 # doesn't. We need to be careful here with ibyte/ichar because of words
2717 # like Richard, @code{eicharlen()}, multibyte, HIBYTE, etc.
2718
2719 gr Ibyte Intbyte $files
2720 gr '\bIBYTE' INTBYTE $files
2721 gr '\bibyte' intbyte $files
2722 gr '\bICHAR' EMCHAR $files
2723 gr '\bichar' emchar $files
2724 gr '\bIchar' Emchar $files
2725 gr '\bIBYTEPTR' CHARPTR $files
2726 gr '\bibyteptr' charptr $files
2727 gr '\bITEXT' CHARPTR $files
2728 gr '\bitext' charptr $files
2729 gr '\bItext' CHARPTR $files
2730
2731 gr '_IBYTE' _INTBYTE $files
2732 gr '_ibyte' _intbyte $files
2733 gr '_ICHAR' _EMCHAR $files
2734 gr '_ichar' _emchar $files
2735 gr '_Ichar' _Emchar $files
2736 gr '_IBYTEPTR' _CHARPTR $files
2737 gr '_ibyteptr' _charptr $files
2738 gr '_ITEXT' _CHARPTR $files
2739 gr '_itext' _charptr $files
2740 gr '_Itext' _CHARPTR $files
2741 @end example
2742
2743 @node Rules When Writing New C Code, Regression Testing XEmacs, Major Textual Changes, Top
2744 @chapter Rules When Writing New C Code
2745 @cindex writing new C code, rules when
2746 @cindex C code, rules when writing new
2747 @cindex code, rules when writing new C
2748
2749 The XEmacs C Code is extremely complex and intricate, and there are many
2750 rules that are more or less consistently followed throughout the code.
2751 Many of these rules are not obvious, so they are explained here. It is
2752 of the utmost importance that you follow them. If you don't, you may
2753 get something that appears to work, but which will crash in odd
2754 situations, often in code far away from where the actual breakage is.
2755
2756 @menu
2757 * A Reader's Guide to XEmacs Coding Conventions::
2758 * General Coding Rules::
2759 * Object-Oriented Techniques for C::
2760 * Writing Lisp Primitives::
2761 * Writing Good Comments::
2762 * Adding Global Lisp Variables::
2763 * Writing Macros::
2764 * Proper Use of Unsigned Types::
2765 * Techniques for XEmacs Developers::
2766 @end menu
2767
2768 See also @ref{Coding for Mule}.
2769
2770 @node A Reader's Guide to XEmacs Coding Conventions, General Coding Rules, Rules When Writing New C Code, Rules When Writing New C Code
2771 @section A Reader's Guide to XEmacs Coding Conventions
2772 @cindex coding conventions
2773 @cindex reader's guide
2774 @cindex coding rules, naming
2775
2776 Of course the low-level implementation language of XEmacs is C, but much
2777 of that uses the Lisp engine to do its work. However, because the code
2778 is ``inside'' of the protective containment shell around the ``reactor
2779 core,'' you'll see lots of complex ``plumbing'' needed to do the work
2780 and ``safety mechanisms,'' whose failure results in a meltdown. This
2781 section provides a quick overview (or review) of the various components
2782 of the implementation of Lisp objects.
2783
2784 Two typographic conventions help to identify C objects that implement
2785 Lisp objects. The first is that capitalized identifiers, especially
2786 beginning with the letters @samp{Q}, @samp{V}, @samp{F}, and @samp{S},
2787 for C variables and functions, and C macros with beginning with the
2788 letter @samp{X}, are used to implement Lisp. The second is that where
2789 Lisp uses the hyphen @samp{-} in symbol names, the corresponding C
2790 identifiers use the underscore @samp{_}. Of course, since XEmacs Lisp
2791 contains interfaces to many external libraries, those external names
2792 will follow the coding conventions their authors chose, and may overlap
2793 the ``XEmacs name space.'' However these cases are usually pretty
2794 obvious.
2795
2796 All Lisp objects are handled indirectly. The @code{Lisp_Object}
2797 type is usually a pointer to a structure, except for a very small number
2798 of types with immediate representations (currently characters and
2799 integers). However, these types cannot be directly operated on in C
2800 code, either, so they can also be considered indirect. Types that do
2801 not have an immediate representation always have a C typedef
2802 @code{Lisp_@var{type}} for a corresponding structure.
2803 @c #### mention l(c)records here?
2804
2805 In older code, it was common practice to pass around pointers to
2806 @code{Lisp_@var{type}}, but this is now deprecated in favor of using
2807 @code{Lisp_Object} for all function arguments and return values that are
2808 Lisp objects. The @code{X@var{type}} macro is used to extract the
2809 pointer and cast it to @code{(Lisp_@var{type} *)} for the desired type.
2810
2811 @strong{Convention}: macros whose names begin with @samp{X} operate on
2812 @code{Lisp_Object}s and do no type-checking. Many such macros are type
2813 extractors, but others implement Lisp operations in C (@emph{e.g.},
2814 @code{XCAR} implements the Lisp @code{car} function). These are unsafe,
2815 and must only be used where types of all data have already been checked.
2816 Such macros are only applied to @code{Lisp_Object}s. In internal
2817 implementations where the pointer has already been converted, the
2818 structure is operated on directly using the C @code{->} member access
2819 operator.
2820
2821 The @code{@var{type}P}, @code{CHECK_@var{type}}, and
2822 @code{CONCHECK_@var{type}} macros are used to test types. The first
2823 returns a Boolean value, and the latter signal errors. (The
2824 @samp{CONCHECK} variety allows execution to be CONtinued under some
2825 circumstances, thus the name.) Functions which expect to be passed user
2826 data invariably call @samp{CHECK} macros on arguments.
2827
2828 There are many types of specialized Lisp objects implemented in C, but
2829 the most pervasive type is the @dfn{symbol}. Symbols are used as
2830 identifiers, variables, and functions.
2831
2832 @strong{Convention}: Global variables whose names begin with @samp{Q}
2833 are constants whose value is a symbol. The name of the variable should
2834 be derived from the name of the symbol using the same rules as for Lisp
2835 primitives. Such variables allow the C code to check whether a
2836 particular @code{Lisp_Object} is equal to a given symbol. Symbols are
2837 Lisp objects, so these variables may be passed to Lisp primitives. (An
2838 alternative to the use of @samp{Q...} variables is to call the
2839 @code{intern} function at initialization in the
2840 @code{vars_of_@var{module}} function, which is hardly less efficient.)
2841
2842 @strong{Convention}: Global variables whose names begin with @samp{V}
2843 are variables that contain Lisp objects. The convention here is that
2844 all global variables of type @code{Lisp_Object} begin with @samp{V}, and
2845 no others do (not even integer and boolean variables that have Lisp
2846 equivalents). Most of the time, these variables have equivalents in
2847 Lisp, which are defined via the @samp{DEFVAR} family of macros, but some
2848 don't. Since the variable's value is a @code{Lisp_Object}, it can be
2849 passed to Lisp primitives.
2850
2851 The implementation of Lisp primitives is more complex.
2852 @strong{Convention}: Global variables with names beginning with @samp{S}
2853 contain a structure that allows the Lisp engine to identify and call a C
2854 function. In modern versions of XEmacs, these identifiers are almost
2855 always completely hidden in the @code{DEFUN} and @code{SUBR} macros, but
2856 you will encounter them if you look at very old versions of XEmacs or at
2857 GNU Emacs. @strong{Convention}: Functions with names beginning with
2858 @samp{F} implement Lisp primitives. Of course all their arguments and
2859 their return values must be Lisp_Objects. (This is hidden in the
2860 @code{DEFUN} macro.)
2861
2862
2863 @node General Coding Rules, Object-Oriented Techniques for C, A Reader's Guide to XEmacs Coding Conventions, Rules When Writing New C Code
2864 @section General Coding Rules
2865 @cindex coding rules, general
2866
2867 The C code is actually written in a dialect of C called @dfn{Clean C},
2868 meaning that it can be compiled, mostly warning-free, with either a C or
2869 C++ compiler. Coding in Clean C has several advantages over plain C.
2870 C++ compilers are more nit-picking, and a number of coding errors have
2871 been found by compiling with C++. The ability to use both C and C++
2872 tools means that a greater variety of development tools are available to
2873 the developer. In addition, the ability to overload operators in C++
2874 means it is possible, for error-checking purposes, to redefine certain
2875 simple types (normally defined as aliases for simple built-in types such
2876 as @code{unsigned char} or @code{long}) as classes, strictly limiting the permissible
2877 operations and catching illegal implicit casts and such.
2878
2879 Every module includes @file{<config.h>} (angle brackets so that
2880 @samp{--srcdir} works correctly; @file{config.h} may or may not be in
2881 the same directory as the C sources) and @file{lisp.h}. @file{config.h}
2882 must always be included before any other header files (including
2883 system header files) to ensure that certain tricks played by various
2884 @file{s/} and @file{m/} files work out correctly.
2885
2886 When including header files, always use angle brackets, not double
2887 quotes, except when the file to be included is always in the same
2888 directory as the including file. If either file is a generated file,
2889 then that is not likely to be the case. In order to understand why we
2890 have this rule, imagine what happens when you do a build in the source
2891 directory using @samp{./configure} and another build in another
2892 directory using @samp{../work/configure}. There will be two different
2893 @file{config.h} files. Which one will be used if you @samp{#include
2894 "config.h"}?
2895
2896 Almost every module contains a @code{syms_of_*()} function and a
2897 @code{vars_of_*()} function. The former declares any Lisp primitives
2898 you have defined and defines any symbols you will be using. The latter
2899 declares any global Lisp variables you have added and initializes global
2900 C variables in the module. @strong{Important}: There are stringent
2901 requirements on exactly what can go into these functions. See the
2902 comment in @file{emacs.c}. The reason for this is to avoid obscure
2903 unwanted interactions during initialization. If you don't follow these
2904 rules, you'll be sorry! If you want to do anything that isn't allowed,
2905 create a @code{complex_vars_of_*()} function for it. Doing this is
2906 tricky, though: you have to make sure your function is called at the
2907 right time so that all the initialization dependencies work out.
2908
2909 Declare each function of these kinds in @file{symsinit.h}. Make sure
2910 it's called in the appropriate place in @file{emacs.c}. You never need
2911 to include @file{symsinit.h} directly, because it is included by
2912 @file{lisp.h}.
2913
2914 @strong{All global and static variables that are to be modifiable must
2915 be declared uninitialized.} This means that you may not use the
2916 ``declare with initializer'' form for these variables, such as @code{int
2917 some_variable = 0;}. The reason for this has to do with some kludges
2918 done during the dumping process: If possible, the initialized data
2919 segment is re-mapped so that it becomes part of the (unmodifiable) code
2920 segment in the dumped executable. This allows this memory to be shared
2921 among multiple running XEmacs processes. XEmacs is careful to place as
2922 much constant data as possible into initialized variables during the
2923 @file{temacs} phase.
2924
2925 @cindex copy-on-write
2926 @strong{Please note:} This kludge only works on a few systems nowadays,
2927 and is rapidly becoming irrelevant because most modern operating systems
2928 provide @dfn{copy-on-write} semantics. All data is initially shared
2929 between processes, and a private copy is automatically made (on a
2930 page-by-page basis) when a process first attempts to write to a page of
2931 memory.
2932
2933 Formerly, there was a requirement that static variables not be declared
2934 inside of functions. This had to do with another hack along the same
2935 vein as what was just described: old USG systems put statically-declared
2936 variables in the initialized data space, so those header files had a
2937 @code{#define static} declaration. (That way, the data-segment remapping
2938 described above could still work.) This fails badly on static variables
2939 inside of functions, which suddenly become automatic variables;
2940 therefore, you weren't supposed to have any of them. This awful kludge
2941 has been removed in XEmacs because
2942
2943 @enumerate
2944 @item
2945 almost all of the systems that used this kludge ended up having
2946 to disable the data-segment remapping anyway;
2947 @item
2948 the only systems that didn't were extremely outdated ones;
2949 @item
2950 this hack completely messed up inline functions.
2951 @end enumerate
2952
2953 The C source code makes heavy use of C preprocessor macros. One popular
2954 macro style is:
2955
2956 @example
2957 #define FOO(var, value) do @{ \
2958 Lisp_Object FOO_value = (value); \
2959 ... /* compute using FOO_value */ \
2960 (var) = bar; \
2961 @} while (0)
2962 @end example
2963
2964 The @code{do @{...@} while (0)} is a standard trick to allow FOO to have
2965 statement semantics, so that it can safely be used within an @code{if}
2966 statement in C, for example. Multiple evaluation is prevented by
2967 copying a supplied argument into a local variable, so that
2968 @code{FOO(var,fun(1))} only calls @code{fun} once.
2969
2970 Lisp lists are popular data structures in the C code as well as in
2971 Elisp. There are two sets of macros that iterate over lists.
2972 @code{EXTERNAL_LIST_LOOP_@var{n}} should be used when the list has been
2973 supplied by the user, and cannot be trusted to be acyclic and
2974 @code{nil}-terminated. A @code{malformed-list} or @code{circular-list} error
2975 will be generated if the list being iterated over is not entirely
2976 kosher. @code{LIST_LOOP_@var{n}}, on the other hand, is faster and less
2977 safe, and can be used only on trusted lists.
2978
2979 Related macros are @code{GET_EXTERNAL_LIST_LENGTH} and
2980 @code{GET_LIST_LENGTH}, which calculate the length of a list, and in the
2981 case of @code{GET_EXTERNAL_LIST_LENGTH}, validating the properness of
2982 the list. The macros @code{EXTERNAL_LIST_LOOP_DELETE_IF} and
2983 @code{LIST_LOOP_DELETE_IF} delete elements from a lisp list satisfying some
2984 predicate.
2985
2986 @node Object-Oriented Techniques for C, Writing Lisp Primitives, General Coding Rules, Rules When Writing New C Code
2987 @section Object-Oriented Techniques for C
2988 @cindex coding rules, object-oriented
2989 @cindex object-oriented techniques
2990
2991 At the lowest levels, XEmacs makes heavy use of object-oriented
2992 techniques to promote code-sharing and uniform interfaces for different
2993 devices and platforms. Commonly, but not always, such objects are
2994 ``wrapped'' and exported to Lisp as Lisp objects. Usually they use
2995 the internal structures developed for Lisp objects (the @samp{lrecord}
2996 structure) in order to take advantage of Lisp memory management.
2997 Unfortunately, XEmacs was originally written in C, so these techniques
2998 are based on heavy use of C macros.
2999
3000 @c You can't use @var{} for type below, because case is important.
3001 A module defining a class is likely to use most of the following
3002 declarations and macros. In the following, the notation @samp{<type>}
3003 will stand for the full name of the class, and will be capitalized in
3004 the way normal for its context. The notation @samp{<typ>} will stand
3005 for the abbreviated form commonly used in macro names, while @samp{ty}
3006 will be used as the typical name for instances of the class. (See the
3007 entry for @samp{MAYBE_<TY>METH} below for an example using all three
3008 notations.)
3009
3010 In the interface (@file{.h} file), the following declarations are used
3011 often. Others may be used in for particular modules. Since they're
3012 quite short in most cases, the definitions are given as well. The
3013 generic macros used are defined in @file{lisp.h} or @file{lrecord.h}.
3014
3015 @c #### reorganize this table into stuff used in general code, and stuff
3016 @c used only in declarations or initializations
3017 @table @samp
3018 @c #### declaration
3019 @item typedef struct Lisp_<Type> Lisp_<Type>
3020 This refers to the internal structure used by C code. The XEmacs coding
3021 style now forbids passing pointers to @samp{Lisp_<Type>} structures into
3022 or out of a function; instead, a @samp{Lisp_Object} should be passed or
3023 returned (created using @samp{wrap_<type>}, if necessary).
3024
3025 @c #### declaration
3026 @item DECLARE_LRECORD (<type>, Lisp_<Type>)
3027 Declares an @samp{lrecord} for @samp{<Type>}, which is the unit of
3028 allocation.
3029
3030 @item #define X<TYPE>(x) XRECORD (x, <type>, Lisp_<Type>)
3031 Turns a @code{Lisp_Object} into a pointer to @samp{struct Lisp_<Type>}.
3032
3033 @item #define wrap_<type>(p) wrap_record (p, <type>)
3034 Turns a pointer to @samp{struct Lisp_<Type>} into a @code{Lisp_Object}.
3035
3036 @item #define <TYPE>P(x) RECORDP (x, <type>)
3037 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}.
3038 Returns a C int, not a Lisp Boolean value.
3039
3040 @item #define CHECK_<TYPE>(x) CHECK_RECORD (x, <type>)
3041 @itemx #define CONCHECK_<TYPE>(x) CONCHECK_RECORD (x, <type>)
3042 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>},
3043 and signals a Lisp error if not. The @samp{CHECK} version of the macro
3044 never returns if the type is wrong, while the @samp{CONCHECK} version
3045 can return if the user catches it in the debugger and explicitly
3046 requests a return.
3047
3048 @item #define RAW_<TYP>METH(ty, m) ((ty)->methods->m##_method)
3049 Return a function pointer for the method for an object @var{TY} of class
3050 @samp{Lisp_<Type>}, or @samp{NULL} if there is none for this type.
3051
3052 @item #define HAS_<TYP>METH_P(ty, m) (!!RAW_<TYP>METH (ty, m))
3053 Test whether the class that @var{TY} is an instance of has the method.
3054
3055 @item #define <TYP>METH(ty, m, args) ((RAW_<TYP>METH (ty, m)) args)
3056 Call the method on @samp{args}. @samp{args} must be enclosed in
3057 parentheses in the call. It is the programmer's responsibility to
3058 ensure that the method is available. The standard convenience macro
3059 @samp{MAYBE_<TYP>METH} is often provided for the common case where a
3060 void-returning method of @samp{Type} is called.
3061
3062 @item #define MAYBE_<TYP>METH(ty, m, args) do @{ ... @} while (0)
3063 Call a void-returning @samp{<Type>} method, if it exists. Note the use
3064 of the @samp{do ... while (0)} idiom to give the macro call C statement
3065 semantics. The full definition is equally idiomatic:
3066
3067 @example
3068 #define MAYBE_<TYP>METH(ty, m, args) do @{ \
3069 Lisp_<Type> *maybe_<typ>meth_ty = (ty); \
3070 if (HAS_<TYP>METH_P (maybe_<typ>meth_ty, m)) \
3071 <TYP>METH (maybe_<typ>meth_ty, m, args); \
3072 @} while (0)
3073 @end example
3074 @end table
3075
3076 The use of macros for invoking an object's methods makes life a bit
3077 difficult for the student or maintainer when browsing the code. In
3078 particular, calls are of the form @samp{<TYP>METH (ty, some_method, (x,
3079 y))}, but definitions typically are for @samp{<subtype>_some_method}.
3080 Thus, when you are trying to find calls, you need to grep for
3081 @samp{some_method}, but this will also catch calls and definitions of
3082 that method for instances of other subtypes of @samp{<Type>}, and there
3083 may be a rather large number of them.
3084
3085
3086 @node Writing Lisp Primitives, Writing Good Comments, Object-Oriented Techniques for C, Rules When Writing New C Code
3087 @section Writing Lisp Primitives
3088 @cindex writing Lisp primitives
3089 @cindex Lisp primitives, writing
3090 @cindex primitives, writing Lisp
3091
3092 Lisp primitives are Lisp functions implemented in C. The details of
3093 interfacing the C function so that Lisp can call it are handled by a few
3094 C macros. The only way to really understand how to write new C code is
3095 to read the source, but we can explain some things here.
3096
3097 An example of a special form is the definition of @code{prog1}, from
3098 @file{eval.c}. (An ordinary function would have the same general
3099 appearance.)
3100
3101 @cindex garbage collection protection
3102 @smallexample
3103 @group
3104 DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /*
3105 Similar to `progn', but the value of the first form is returned.
3106 \(prog1 FIRST BODY...): All the arguments are evaluated sequentially.
3107 The value of FIRST is saved during evaluation of the remaining args,
3108 whose values are discarded.
3109 */
3110 (args))
3111 @{
3112 /* This function can GC */
3113 REGISTER Lisp_Object val, form, tail;
3114 struct gcpro gcpro1;
3115
3116 val = Feval (XCAR (args));
3117
3118 GCPRO1 (val);
3119
3120 LIST_LOOP_3 (form, XCDR (args), tail)
3121 Feval (form);
3122
3123 UNGCPRO;
3124 return val;
3125 @}
3126 @end group
3127 @end smallexample
3128
3129 Let's start with a precise explanation of the arguments to the
3130 @code{DEFUN} macro. Here is a template for them:
3131
3132 @example
3133 @group
3134 DEFUN (@var{lname}, @var{fname}, @var{min_args}, @var{max_args}, @var{interactive}, /*
3135 @var{docstring}
3136 */
3137 (@var{arglist}))
3138 @end group
3139 @end example
3140
3141 @table @var
3142 @item lname
3143 This string is the name of the Lisp symbol to define as the function
3144 name; in the example above, it is @code{"prog1"}.
3145
3146 @item fname
3147 This is the C function name for this function. This is the name that is
3148 used in C code for calling the function. The name is, by convention,
3149 @samp{F} prepended to the Lisp name, with all dashes (@samp{-}) in the
3150 Lisp name changed to underscores. Thus, to call this function from C
3151 code, call @code{Fprog1}. Remember that the arguments are of type
3152 @code{Lisp_Object}; various macros and functions for creating values of
3153 type @code{Lisp_Object} are declared in the file @file{lisp.h}.
3154
3155 Primitives whose names are special characters (e.g. @code{+} or
3156 @code{<}) are named by spelling out, in some fashion, the special
3157 character: e.g. @code{Fplus()} or @code{Flss()}. Primitives whose names
3158 begin with normal alphanumeric characters but also contain special
3159 characters are spelled out in some creative way, e.g. @code{let*}
3160 becomes @code{FletX()}.
3161
3162 Each function also has an associated structure that holds the data for
3163 the subr object that represents the function in Lisp. This structure
3164 conveys the Lisp symbol name to the initialization routine that will
3165 create the symbol and store the subr object as its definition. The C
3166 variable name of this structure is always @samp{S} prepended to the
3167 @var{fname}. You hardly ever need to be aware of the existence of this
3168 structure, since @code{DEFUN} plus @code{DEFSUBR} takes care of all the
3169 details.
3170
3171 @item min_args
3172 This is the minimum number of arguments that the function requires. The
3173 function @code{prog1} allows a minimum of one argument.
3174
3175 @item max_args
3176 This is the maximum number of arguments that the function accepts, if
3177 there is a fixed maximum. Alternatively, it can be @code{UNEVALLED},
3178 indicating a special form that receives unevaluated arguments, or
3179 @code{MANY}, indicating an unlimited number of evaluated arguments (the
3180 C equivalent of @code{&rest}). Both @code{UNEVALLED} and @code{MANY}
3181 are macros. If @var{max_args} is a number, it may not be less than
3182 @var{min_args} and it may not be greater than 8. (If you need to add a
3183 function with more than 8 arguments, use the @code{MANY} form. Resist
3184 the urge to edit the definition of @code{DEFUN} in @file{lisp.h}. If
3185 you do it anyways, make sure to also add another clause to the switch
3186 statement in @code{primitive_funcall().})
3187
3188 @item interactive
3189 This is an interactive specification, a string such as might be used as
3190 the argument of @code{interactive} in a Lisp function. In the case of
3191 @code{prog1}, it is 0 (a null pointer), indicating that @code{prog1}
3192 cannot be called interactively. A value of @code{""} indicates a
3193 function that should receive no arguments when called interactively.
3194
3195 @item docstring
3196 This is the documentation string. It is written just like a
3197 documentation string for a function defined in Lisp; in particular, the
3198 first line should be a single sentence. Note how the documentation
3199 string is enclosed in a comment, none of the documentation is placed on
3200 the same lines as the comment-start and comment-end characters, and the
3201 comment-start characters are on the same line as the interactive
3202 specification. @file{make-docfile}, which scans the C files for
3203 documentation strings, is very particular about what it looks for, and
3204 will not properly extract the doc string if it's not in this exact format.
3205
3206 In order to make both @file{etags} and @file{make-docfile} happy, make
3207 sure that the @code{DEFUN} line contains the @var{lname} and
3208 @var{fname}, and that the comment-start characters for the doc string
3209 are on the same line as the interactive specification, and put a newline
3210 directly after them (and before the comment-end characters).
3211
3212 @item arglist
3213 This is the comma-separated list of arguments to the C function. For a
3214 function with a fixed maximum number of arguments, provide a C argument
3215 for each Lisp argument. In this case, unlike regular C functions, the
3216 types of the arguments are not declared; they are simply always of type
3217 @code{Lisp_Object}.
3218
3219 The names of the C arguments will be used as the names of the arguments
3220 to the Lisp primitive as displayed in its documentation, modulo the same
3221 concerns described above for @code{F...} names (in particular,
3222 underscores in the C arguments become dashes in the Lisp arguments).
3223
3224 There is one additional kludge: A trailing @samp{_} on the C argument is
3225 discarded when forming the Lisp argument. This allows C language
3226 reserved words (like @code{default}) or global symbols (like
3227 @code{dirname}) to be used as argument names without compiler warnings
3228 or errors.
3229
3230 A Lisp function with @w{@var{max_args} = @code{UNEVALLED}} is a
3231 @w{@dfn{special form}}; its arguments are not evaluated. Instead it
3232 receives one argument of type @code{Lisp_Object}, a (Lisp) list of the
3233 unevaluated arguments, conventionally named @code{(args)}.
3234
3235 When a Lisp function has no upper limit on the number of arguments,
3236 specify @w{@var{max_args} = @code{MANY}}. In this case its implementation in
3237 C actually receives exactly two arguments: the number of Lisp arguments
3238 (an @code{int}) and the address of a block containing their values (a
3239 @w{@code{Lisp_Object *}}). In this case only are the C types specified
3240 in the @var{arglist}: @w{@code{(int nargs, Lisp_Object *args)}}.
3241
3242 @end table
3243
3244 Within the function @code{Fprog1} itself, note the use of the macros
3245 @code{GCPRO1} and @code{UNGCPRO}. @code{GCPRO1} is used to ``protect''
3246 a variable from garbage collection---to inform the garbage collector
3247 that it must look in that variable and regard the object pointed at by
3248 its contents as an accessible object. This is necessary whenever you
3249 call @code{Feval} or anything that can directly or indirectly call
3250 @code{Feval} (this includes the @code{QUIT} macro!). At such a time,
3251 any Lisp object that you intend to refer to again must be protected
3252 somehow. @code{UNGCPRO} cancels the protection of the variables that
3253 are protected in the current function. It is necessary to do this
3254 explicitly.
3255
3256 The macro @code{GCPRO1} protects just one local variable. If you want
3257 to protect two, use @code{GCPRO2} instead; repeating @code{GCPRO1} will
3258 not work. Macros @code{GCPRO3} and @code{GCPRO4} also exist.
3259
3260 These macros implicitly use local variables such as @code{gcpro1}; you
3261 must declare these explicitly, with type @code{struct gcpro}. Thus, if
3262 you use @code{GCPRO2}, you must declare @code{gcpro1} and @code{gcpro2}.
3263
3264 @cindex caller-protects (@code{GCPRO} rule)
3265 Note also that the general rule is @dfn{caller-protects}; i.e. you are
3266 only responsible for protecting those Lisp objects that you create. Any
3267 objects passed to you as arguments should have been protected by whoever
3268 created them, so you don't in general have to protect them.
3269
3270 In particular, the arguments to any Lisp primitive are always
3271 automatically @code{GCPRO}ed, when called ``normally'' from Lisp code or
3272 bytecode. So only a few Lisp primitives that are called frequently from
3273 C code, such as @code{Fprogn} protect their arguments as a service to
3274 their caller. You don't need to protect your arguments when writing a
3275 new @code{DEFUN}.
3276
3277 @code{GCPRO}ing is perhaps the trickiest and most error-prone part of
3278 XEmacs coding. It is @strong{extremely} important that you get this
3279 right and use a great deal of discipline when writing this code.
3280 @xref{GCPROing, ,@code{GCPRO}ing}, for full details on how to do this.
3281
3282 What @code{DEFUN} actually does is declare a global structure of type
3283 @code{Lisp_Subr} whose name begins with capital @samp{SF} and which
3284 contains information about the primitive (e.g. a pointer to the
3285 function, its minimum and maximum allowed arguments, a string describing
3286 its Lisp name); @code{DEFUN} then begins a normal C function declaration
3287 using the @code{F...} name. The Lisp subr object that is the function
3288 definition of a primitive (i.e. the object in the function slot of the
3289 symbol that names the primitive) actually points to this @samp{SF}
3290 structure; when @code{Feval} encounters a subr, it looks in the
3291 structure to find out how to call the C function.
3292
3293 Defining the C function is not enough to make a Lisp primitive
3294 available; you must also create the Lisp symbol for the primitive (the
3295 symbol is @dfn{interned}; @pxref{Obarrays}) and store a suitable subr
3296 object in its function cell. (If you don't do this, the primitive won't
3297 be seen by Lisp code.) The code looks like this:
3298
3299 @example
3300 DEFSUBR (@var{fname});
3301 @end example
3302
3303 @noindent
3304 Here @var{fname} is the same name you used as the second argument to
3305 @code{DEFUN}.
3306
3307 This call to @code{DEFSUBR} should go in the @code{syms_of_*()} function
3308 at the end of the module. If no such function exists, create it and
3309 make sure to also declare it in @file{symsinit.h} and call it from the
3310 appropriate spot in @code{main()}. @xref{General Coding Rules}.
3311
3312 Note that C code cannot call functions by name unless they are defined
3313 in C. The way to call a function written in Lisp from C is to use
3314 @code{Ffuncall}, which embodies the Lisp function @code{funcall}. Since
3315 the Lisp function @code{funcall} accepts an unlimited number of
3316 arguments, in C it takes two: the number of Lisp-level arguments, and a
3317 one-dimensional array containing their values. The first Lisp-level
3318 argument is the Lisp function to call, and the rest are the arguments to
3319 pass to it. Since @code{Ffuncall} can call the evaluator, you must
3320 protect pointers from garbage collection around the call to
3321 @code{Ffuncall}. (However, @code{Ffuncall} explicitly protects all of
3322 its parameters, so you don't have to protect any pointers passed as
3323 parameters to it.)
3324
3325 The C functions @code{call0}, @code{call1}, @code{call2}, and so on,
3326 provide handy ways to call a Lisp function conveniently with a fixed
3327 number of arguments. They work by calling @code{Ffuncall}.
3328
3329 @file{eval.c} is a very good file to look through for examples;
3330 @file{lisp.h} contains the definitions for important macros and
3331 functions.
3332
3333 @node Writing Good Comments, Adding Global Lisp Variables, Writing Lisp Primitives, Rules When Writing New C Code
3334 @section Writing Good Comments
3335 @cindex writing good comments
3336 @cindex comments, writing good
3337
3338 Comments are a lifeline for programmers trying to understand tricky
3339 code. In general, the less obvious it is what you are doing, the more
3340 you need a comment, and the more detailed it needs to be. You should
3341 always be on guard when you're writing code for stuff that's tricky, and
3342 should constantly be putting yourself in someone else's shoes and asking
3343 if that person could figure out without much difficulty what's going
3344 on. (Assume they are a competent programmer who understands the
3345 essentials of how the XEmacs code is structured but doesn't know much
3346 about the module you're working on or any algorithms you're using.) If
3347 you're not sure whether they would be able to, add a comment. Always
3348 err on the side of more comments, rather than less.
3349
3350 Generally, when making comments, there is no need to attribute them with
3351 your name or initials. This especially goes for small,
3352 easy-to-understand, non-opinionated ones. Also, comments indicating
3353 where, when, and by whom a file was changed are @emph{strongly}
3354 discouraged, and in general will be removed as they are discovered.
3355 This is exactly what @file{ChangeLogs} are there for. However, it can
3356 occasionally be useful to mark exactly where (but not when or by whom)
3357 changes are made, particularly when making small changes to a file
3358 imported from elsewhere. These marks help when later on a newer version
3359 of the file is imported and the changes need to be merged. (If
3360 everything were always kept in CVS, there would be no need for this.
3361 But in practice, this often doesn't happen, or the CVS repository is
3362 later on lost or unavailable to the person doing the update.)
3363
3364 When putting in an explicit opinion in a comment, you should
3365 @emph{always} attribute it with your name and the date. This also goes
3366 for long, complex comments explaining in detail the workings of
3367 something -- by putting your name there, you make it possible for
3368 someone who has questions about how that thing works to determine who
3369 wrote the comment so they can write to them. Use your actual name or
3370 your alias at xemacs.org, and not your initials or nickname, unless that
3371 is generally recognized (e.g. @samp{jwz}). Even then, please consider
3372 requesting a virtual user at xemacs.org (forwarding address; we can't
3373 provide an actual mailbox). Otherwise, give first and last name. If
3374 you're not a regular contributor, you might consider putting your email
3375 address in -- it may be in the ChangeLog, but after awhile ChangeLogs
3376 have a tendency of disappearing or getting muddled. (E.g. your comment
3377 may get copied somewhere else or even into another program, and tracking
3378 down the proper ChangeLog may be very difficult.)
3379
3380 If you come across an opinion that is not or is no longer valid, or you
3381 come across any comment that no longer applies but you want to keep it
3382 around, enclose it in @samp{[[ } and @samp{ ]]} marks and add a comment
3383 afterwards explaining why the preceding comment is no longer valid. Put
3384 your name on this comment, as explained above.
3385
3386 Just as comments are a lifeline to programmers, incorrect comments are
3387 death. If you come across an incorrect comment, @strong{immediately}
3388 correct it or flag it as incorrect, as described in the previous
3389 paragraph. Whenever you work on a section of code, @emph{always} make
3390 sure to update any comments to be correct -- or, at the very least, flag
3391 them as incorrect.
3392
3393 To indicate a "todo" or other problem, use four pound signs --
3394 i.e. @samp{####}.
3395
3396 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code
3397 @section Adding Global Lisp Variables
3398 @cindex global Lisp variables, adding
3399 @cindex variables, adding global Lisp
3400
3401 Global variables whose names begin with @samp{Q} are constants whose
3402 value is a symbol of a particular name. The name of the variable should
3403 be derived from the name of the symbol using the same rules as for Lisp
3404 primitives. These variables are initialized using a call to
3405 @code{defsymbol()} in the @code{syms_of_*()} function. (This call
3406 interns a symbol, sets the C variable to the resulting Lisp object, and
3407 calls @code{staticpro()} on the C variable to tell the
3408 garbage-collection mechanism about this variable. What
3409 @code{staticpro()} does is add a pointer to the variable to a large
3410 global array; when garbage-collection happens, all pointers listed in
3411 the array are used as starting points for marking Lisp objects. This is
3412 important because it's quite possible that the only current reference to
3413 the object is the C variable. In the case of symbols, the
3414 @code{staticpro()} doesn't matter all that much because the symbol is
3415 contained in @code{obarray}, which is itself @code{staticpro()}ed.
3416 However, it's possible that a naughty user could do something like
3417 uninterning the symbol out of @code{obarray} or even setting
3418 @code{obarray} to a different value [although this is likely to make
3419 XEmacs crash!].)
3420
3421 @strong{Please note:} It is potentially deadly if you declare a
3422 @samp{Q...} variable in two different modules. The two calls to
3423 @code{defsymbol()} are no problem, but some linkers will complain about
3424 multiply-defined symbols. The most insidious aspect of this is that
3425 often the link will succeed anyway, but then the resulting executable
3426 will sometimes crash in obscure ways during certain operations!
3427
3428 To avoid this problem, declare any symbols with common names (such as
3429 @code{text}) that are not obviously associated with this particular
3430 module in the file @file{general-slots.h}. The ``-slots'' suffix
3431 indicates that this is a file that is included multiple times in
3432 @file{general.c}. Redefinition of preprocessor macros allows the
3433 effects to be different in each context, so this is actually more
3434 convenient and less error-prone than doing it in your module.
3435
3436 Global variables whose names begin with @samp{V} are variables that
3437 contain Lisp objects. The convention here is that all global variables
3438 of type @code{Lisp_Object} begin with @samp{V}, and all others don't
3439 (including integer and boolean variables that have Lisp
3440 equivalents). Most of the time, these variables have equivalents in
3441 Lisp, but some don't. Those that do are declared this way by a call to
3442 @code{DEFVAR_LISP()} in the @code{vars_of_*()} initializer for the
3443 module. What this does is create a special @dfn{symbol-value-forward}
3444 Lisp object that contains a pointer to the C variable, intern a symbol
3445 whose name is as specified in the call to @code{DEFVAR_LISP()}, and set
3446 its value to the symbol-value-forward Lisp object; it also calls
3447 @code{staticpro()} on the C variable to tell the garbage-collection
3448 mechanism about the variable. When @code{eval} (or actually
3449 @code{symbol-value}) encounters this special object in the process of
3450 retrieving a variable's value, it follows the indirection to the C
3451 variable and gets its value. @code{setq} does similar things so that
3452 the C variable gets changed.
3453
3454 Whether or not you @code{DEFVAR_LISP()} a variable, you need to
3455 initialize it in the @code{vars_of_*()} function; otherwise it will end
3456 up as all zeroes, which is the integer 0 (@emph{not} @code{nil}), and
3457 this is probably not what you want. Also, if the variable is not
3458 @code{DEFVAR_LISP()}ed, @strong{you must call} @code{staticpro()} on the
3459 C variable in the @code{vars_of_*()} function. Otherwise, the
3460 garbage-collection mechanism won't know that the object in this variable
3461 is in use, and will happily collect it and reuse its storage for another
3462 Lisp object, and you will be the one who's unhappy when you can't figure
3463 out how your variable got overwritten.
3464
3465 @node Writing Macros, Proper Use of Unsigned Types, Adding Global Lisp Variables, Rules When Writing New C Code
3466 @section Writing Macros
3467 @cindex writing macros
3468 @cindex macros, writing
3469
3470 The three golden rules of macros:
3471
3472 @enumerate
3473 @item
3474 Anything that's an lvalue can be evaluated more than once.
3475 @item
3476 Macros where anything else can be evaluated more than once should
3477 have the word "unsafe" in their name (exceptions may be made for
3478 large sets of macros that evaluate arguments of certain types more
3479 than once, e.g. struct buffer * arguments, when clearly indicated in
3480 the macro documentation). These macros are generally meant to be
3481 called only by other macros that have already stored the calling
3482 values in temporary variables.
3483 @item
3484 Nothing else can be evaluated more than once. Use inline
3485 functions, if necessary, to prevent multiple evaluation.
3486 @end enumerate
3487
3488 NOTE: The functions and macros below are given full prototypes in their
3489 docs, even when the implementation is a macro. In such cases, passing
3490 an argument of a type other than expected will produce undefined
3491 results. Also, given that macros can do things functions can't (in
3492 particular, directly modify arguments as if they were passed by
3493 reference), the declaration syntax has been extended to include the
3494 call-by-reference syntax from C++, where an & after a type indicates
3495 that the argument is an lvalue and is passed by reference, i.e. the
3496 function can modify its value. (This is equivalent in C to passing a
3497 pointer to the argument, but without the need to explicitly worry about
3498 pointers.)
3499
3500 When to capitalize macros:
3501
3502 @itemize @bullet
3503 @item
3504 Capitalize macros doing stuff obviously impossible with (C)
3505 functions, e.g. directly modifying arguments as if they were passed by
3506 reference.
3507 @item
3508 Capitalize macros that evaluate @strong{any} argument more than once regardless
3509 of whether that's "allowed" (e.g. buffer arguments).
3510 @item
3511 Capitalize macros that directly access a field in a Lisp_Object or
3512 its equivalent underlying structure. In such cases, access through the
3513 Lisp_Object precedes the macro with an X, and access through the underlying
3514 structure doesn't.
3515 @item
3516 Capitalize certain other basic macros relating to Lisp_Objects; e.g.
3517 FRAMEP, CHECK_FRAME, etc.
3518 @item
3519 Try to avoid capitalizing any other macros.
3520 @end itemize
3521
3522 @node Proper Use of Unsigned Types, Techniques for XEmacs Developers, Writing Macros, Rules When Writing New C Code
3523 @section Proper Use of Unsigned Types
3524 @cindex unsigned types, proper use of
3525 @cindex types, proper use of unsigned
3526
3527 Avoid using @code{unsigned int} and @code{unsigned long} whenever
3528 possible. Unsigned types are viral -- any arithmetic or comparisons
3529 involving mixed signed and unsigned types are automatically converted to
3530 unsigned, which is almost certainly not what you want. Many subtle and
3531 hard-to-find bugs are created by careless use of unsigned types. In
3532 general, you should almost @emph{never} use an unsigned type to hold a
3533 regular quantity of any sort. The only exceptions are
3534
3535 @enumerate
3536 @item
3537 When there's a reasonable possibility you will actually need all 32 or
3538 64 bits to store the quantity.
3539 @item
3540 When calling existing API's that require unsigned types. In this case,
3541 you should still do all manipulation using signed types, and do the
3542 conversion at the very threshold of the API call.
3543 @item
3544 In existing code that you don't want to modify because you don't
3545 maintain it.
3546 @item
3547 In bit-field structures.
3548 @end enumerate
3549
3550 Other reasonable uses of @code{unsigned int} and @code{unsigned long}
3551 are representing non-quantities -- e.g. bit-oriented flags and such.
3552
3553 @node Techniques for XEmacs Developers, , Proper Use of Unsigned Types, Rules When Writing New C Code
3554 @section Techniques for XEmacs Developers
3555 @cindex techniques for XEmacs developers
3556 @cindex developers, techniques for XEmacs
3557
3558 @cindex Purify
3559 @cindex Quantify
3560 To make a purified XEmacs, do: @code{make puremacs}.
3561 To make a quantified XEmacs, do: @code{make quantmacs}.
3562
3563 You simply can't dump Quantified and Purified images (unless using the
3564 portable dumper). Purify gets confused when xemacs frees memory in one
3565 process that was allocated in a @emph{different} process on a different
3566 machine! Run it like so:
3567 @example
3568 temacs -batch -l loadup.el run-temacs @var{xemacs-args...}
3569 @end example
3570
3571 @cindex error checking
3572 Before you go through the trouble, are you compiling with all
3573 debugging and error-checking off? If not, try that first. Be warned
3574 that while Quantify is directly responsible for quite a few
3575 optimizations which have been made to XEmacs, doing a run which
3576 generates results which can be acted upon is not necessarily a trivial
3577 task.
3578
3579 Also, if you're still willing to do some runs make sure you configure
3580 with the @samp{--quantify} flag. That will keep Quantify from starting
3581 to record data until after the loadup is completed and will shut off
3582 recording right before it shuts down (which generates enough bogus data
3583 to throw most results off). It also enables three additional elisp
3584 commands: @code{quantify-start-recording-data},
3585 @code{quantify-stop-recording-data} and @code{quantify-clear-data}.
3586
3587 If you want to make XEmacs faster, target your favorite slow benchmark,
3588 run a profiler like Quantify, @code{gprof}, or @code{tcov}, and figure
3589 out where the cycles are going. In many cases you can localize the
3590 problem (because a particular new feature or even a single patch
3591 elicited it). Don't hesitate to use brute force techniques like a
3592 global counter incremented at strategic places, especially in
3593 combination with other performance indications (@emph{e.g.}, degree of
3594 buffer fragmentation into extents).
3595
3596 Specific projects:
3597
3598 @itemize @bullet
3599 @item
3600 Make the garbage collector faster. Figure out how to write an
3601 incremental garbage collector.
3602 @item
3603 Write a compiler that takes bytecode and spits out C code.
3604 Unfortunately, you will then need a C compiler and a more fully
3605 developed module system.
3606 @item
3607 Speed up redisplay.
3608 @item
3609 Speed up syntax highlighting. It was suggested that ``maybe moving some
3610 of the syntax highlighting capabilities into C would make a
3611 difference.'' Wrong idea, I think. When processing one 400kB file a
3612 particular low-level routine was being called 40 @emph{million} times
3613 simply for @emph{one} call to @code{newline-and-indent}. Syntax
3614 highlighting needs to be rewritten to use a reliable, fast parser, then
3615 to trust the pre-parsed structure, and only do re-highlighting locally
3616 to a text change. Modern machines are fast enough to implement such
3617 parsers in Lisp; but no machine will ever be fast enough to deal with
3618 quadratic (or worse) algorithms!
3619 @item
3620 Implement tail recursion in Emacs Lisp (hard!).
3621 @end itemize
3622
3623 Unfortunately, Emacs Lisp is slow, and is going to stay slow. Function
3624 calls in elisp are especially expensive. Iterating over a long list is
3625 going to be 30 times faster implemented in C than in Elisp.
3626
3627 Heavily used small code fragments need to be fast. The traditional way
3628 to implement such code fragments in C is with macros. But macros in C
3629 are known to be broken.
3630
3631 @cindex macro hygiene
3632 Macro arguments that are repeatedly evaluated may suffer from repeated
3633 side effects or suboptimal performance.
3634
3635 Variable names used in macros may collide with caller's variables,
3636 causing (at least) unwanted compiler warnings.
3637
3638 In order to solve these problems, and maintain statement semantics, one
3639 should use the @code{do @{ ... @} while (0)} trick while trying to
3640 reference macro arguments exactly once using local variables.
3641
3642 Let's take a look at this poor macro definition:
3643
3644 @example
3645 #define MARK_OBJECT(obj) \
3646 if (!marked_p (obj)) mark_object (obj), did_mark = 1
3647 @end example
3648
3649 This macro evaluates its argument twice, and also fails if used like this:
3650 @example
3651 if (flag) MARK_OBJECT (obj); else @code{do_something()};
3652 @end example
3653
3654 A much better definition is
3655
3656 @example
3657 #define MARK_OBJECT(obj) do @{ \
3658 Lisp_Object mo_obj = (obj); \
3659 if (!marked_p (mo_obj)) \
3660 @{ \
3661 mark_object (mo_obj); \
3662 did_mark = 1; \
3663 @} \
3664 @} while (0)
3665 @end example
3666
3667 Notice the elimination of double evaluation by using the local variable
3668 with the obscure name. Writing safe and efficient macros requires great
3669 care. The one problem with macros that cannot be portably worked around
3670 is, since a C block has no value, a macro used as an expression rather
3671 than a statement cannot use the techniques just described to avoid
3672 multiple evaluation.
3673
3674 @cindex inline functions
3675 In most cases where a macro has function semantics, an inline function
3676 is a better implementation technique. Modern compiler optimizers tend
3677 to inline functions even if they have no @code{inline} keyword, and
3678 configure magic ensures that the @code{inline} keyword can be safely
3679 used as an additional compiler hint. Inline functions used in a single
3680 .c files are easy. The function must already be defined to be
3681 @code{static}. Just add another @code{inline} keyword to the
3682 definition.
3683
3684 @example
3685 inline static int
3686 heavily_used_small_function (int arg)
3687 @{
3688 ...
3689 @}
3690 @end example
3691
3692 Inline functions in header files are trickier, because we would like to
3693 make the following optimization if the function is @emph{not} inlined
3694 (for example, because we're compiling for debugging). We would like the
3695 function to be defined externally exactly once, and each calling
3696 translation unit would create an external reference to the function,
3697 instead of including a definition of the inline function in the object
3698 code of every translation unit that uses it. This optimization is
3699 currently only available for gcc. But you don't have to worry about the
3700 trickiness; just define your inline functions in header files using this
3701 pattern:
3702
3703 @example
3704 INLINE_HEADER int
3705 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg);
3706 INLINE_HEADER int
3707 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg)
3708 @{
3709 ...
3710 @}
3711 @end example
3712
3713 The declaration right before the definition is to prevent warnings when
3714 compiling with @code{gcc -Wmissing-declarations}. I consider issuing
3715 this warning for inline functions a gcc bug, but the gcc maintainers disagree.
3716
3717 @cindex inline functions, headers
3718 @cindex header files, inline functions
3719 Every header which contains inline functions, either directly by using
3720 @code{INLINE_HEADER} or indirectly by using @code{DECLARE_LRECORD} must
3721 be added to @file{inline.c}'s includes to make the optimization
3722 described above work. (Optimization note: if all INLINE_HEADER
3723 functions are in fact inlined in all translation units, then the linker
3724 can just discard @code{inline.o}, since it contains only unreferenced code).
3725
3726 To get started debugging XEmacs, take a look at the @file{.gdbinit} and
3727 @file{.dbxrc} files in the @file{src} directory. See the section in the
3728 XEmacs FAQ on How to Debug an XEmacs problem with a debugger.
3729
3730 After making source code changes, run @code{make check} to ensure that
3731 you haven't introduced any regressions. If you want to make xemacs more
3732 reliable, please improve the test suite in @file{tests/automated}.
3733
3734 Did you make sure you didn't introduce any new compiler warnings?
3735
3736 Before submitting a patch, please try compiling at least once with
3737
3738 @example
3739 configure --with-mule --use-union-type --error-checking=all
3740 @end example
3741
3742 Here are things to know when you create a new source file:
3743
3744 @itemize @bullet
3745 @item
3746 All @file{.c} files should @code{#include <config.h>} first. Almost all
3747 @file{.c} files should @code{#include "lisp.h"} second.
3748
3749 @item
3750 Generated header files should be included using the @samp{#include <...>}
3751 syntax, not the @samp{#include "..."} syntax. The generated headers are:
3752
3753 @file{config.h sheap-adjust.h paths.h Emacs.ad.h}
3754
3755 The basic rule is that you should assume builds using @samp{--srcdir}
3756 and the @samp{#include <...>} syntax needs to be used when the
3757 to-be-included generated file is in a potentially different directory
3758 @emph{at compile time}. The non-obvious C rule is that
3759 @samp{#include "..."} means to search for the included file in the same
3760 directory as the including file, @emph{not} in the current directory.
3761 Normally this is not a problem but when building with @samp{--srcdir},
3762 @file{make} will search the @samp{VPATH} for you, while the C compiler
3763 knows nothing about it.
3764
3765 @item
3766 Header files should @emph{not} include @samp{<config.h>} and
3767 @samp{"lisp.h"}. It is the responsibility of the @file{.c} files that
3768 use it to do so.
3769
3770 @end itemize
3771
3772 @cindex Lisp object types, creating
3773 @cindex creating Lisp object types
3774 @cindex object types, creating Lisp
3775 Here is a checklist of things to do when creating a new lisp object type
3776 named @var{foo}:
3777
3778 @enumerate
3779 @item
3780 create @var{foo}.h
3781 @item
3782 create @var{foo}.c
3783 @item
3784 add definitions of @code{syms_of_@var{foo}}, etc. to @file{@var{foo}.c}
3785 @item
3786 add declarations of @code{syms_of_@var{foo}}, etc. to @file{symsinit.h}
3787 @item
3788 add calls to @code{syms_of_@var{foo}}, etc. to @file{emacs.c}
3789 @item
3790 add definitions of macros like @code{CHECK_@var{FOO}} and
3791 @code{@var{FOO}P} to @file{@var{foo}.h}
3792 @item
3793 add the new type index to @code{enum lrecord_type}
3794 @item
3795 add a DEFINE_LRECORD_IMPLEMENTATION call to @file{@var{foo}.c}
3796 @item
3797 add an INIT_LRECORD_IMPLEMENTATION call to @code{syms_of_@var{foo}.c}
3798 @end enumerate
3799
3800 @node Regression Testing XEmacs, CVS Techniques, Rules When Writing New C Code, Top
3801 @chapter Regression Testing XEmacs
3802 @cindex testing, regression
3803
3804 @menu
3805 * How to Regression-Test::
3806 * Modules for Regression Testing::
3807 @end menu
3808
3809 @node How to Regression-Test, Modules for Regression Testing, Regression Testing XEmacs, Regression Testing XEmacs
3810 @section How to Regression-Test
3811 @cindex how to regression-test
3812 @cindex regression-test, how to
3813 @cindex testing, regression, how to
3814
3815 The source directory @file{tests/automated} contains XEmacs' automated
3816 test suite. The usual way of running all the tests is running
3817 @code{make check} from the top-level build directory.
3818
3819 The test suite is unfinished and it's still lacking some essential
3820 features. It is nevertheless recommended that you run the tests to
3821 confirm that XEmacs behaves correctly.
3822
3823 If you want to run a specific test case, you can do it from the
3824 command-line like this:
3825
3826 @example
3827 $ xemacs -batch -l test-harness.elc -f batch-test-emacs TEST-FILE
3828 @end example
3829
3830 If a test fails and you need more information, you can run the test
3831 suite interactively by loading @file{test-harness.el} into a running
3832 XEmacs and typing @kbd{M-x test-emacs-test-file RET <filename> RET}.
3833 You will see a log of passed and failed tests, which should allow you to
3834 investigate the source of the error and ultimately fix the bug. If you
3835 are not capable of, or don't have time for, debugging it yourself,
3836 please do report the failures using @kbd{M-x report-emacs-bug} or
3837 @kbd{M-x build-report}.
3838
3839 @deffn Command test-emacs-test-file file
3840 Runs the tests in @var{file}. @file{test-harness.el} must be loaded.
3841 Defines all the macros described in this node, and undefines them when
3842 done.
3843 @end deffn
3844
3845 Adding a new test file is trivial: just create a new file here and it
3846 will be run. There is no need to byte-compile any of the files in
3847 this directory---the test-harness will take care of any necessary
3848 byte-compilation.
3849
3850 Look at the existing test cases for the examples of coding test cases.
3851 It all boils down to your imagination and judicious use of the macros
3852 @code{Assert}, @code{Check-Error}, @code{Check-Error-Message}, and
3853 @code{Check-Message}. Note that all of these macros are defined only
3854 for the duration of the test: they do not exist in the global
3855 environment.
3856
3857 @deffn Macro Assert expr
3858 Check that @var{expr} is non-nil at this point in the test.
3859 @end deffn
3860
3861 @deffn Macro Check-Error expected-error body
3862 Check that execution of @var{body} causes @var{expected-error} to be
3863 signaled. @var{body} is a @code{progn}-like body, and may contain
3864 several expressions. @var{expected-error} is a symbol defined as
3865 an error by @code{define-error}.
3866 @end deffn
3867
3868 @deffn Macro Check-Error-Message expected-error expected-error-regexp body
3869 Check that execution of @var{body} causes @var{expected-error} to be
3870 signaled, and generate a message matching @var{expected-error-regexp}.
3871 @var{body} is a @code{progn}-like body, and may contain several
3872 expressions. @var{expected-error} is a symbol defined as an error
3873 by @code{define-error}.
3874 @end deffn
3875
3876 @deffn Macro Check-Message expected-message body
3877 Check that execution of @var{body} causes @var{expected-message} to be
3878 generated (using @code{message} or a similar function). @var{body} is a
3879 @code{progn}-like body, and may contain several expressions.
3880 @end deffn
3881
3882 Here's a simple example checking case-sensitive and case-insensitive
3883 comparisons from @file{case-tests.el}.
3884
3885 @example
3886 (with-temp-buffer
3887 (insert "Test Buffer")
3888 (let ((case-fold-search t))
3889 (goto-char (point-min))
3890 (Assert (eq (search-forward "test buffer" nil t) 12))
3891 (goto-char (point-min))
3892 (Assert (eq (search-forward "Test buffer" nil t) 12))
3893 (goto-char (point-min))
3894 (Assert (eq (search-forward "Test Buffer" nil t) 12))
3895
3896 (setq case-fold-search nil)
3897 (goto-char (point-min))
3898 (Assert (not (search-forward "test buffer" nil t)))
3899 (goto-char (point-min))
3900 (Assert (not (search-forward "Test buffer" nil t)))
3901 (goto-char (point-min))
3902 (Assert (eq (search-forward "Test Buffer" nil t) 12))))
3903 @end example
3904
3905 This example could be saved in a file in @file{tests/automated}, and it
3906 would constitute a complete test, automatically executed when you run
3907 @kbd{make check} after building XEmacs. More complex tests may require
3908 substantial temporary scaffolding to create the environment that elicits
3909 the bugs, but the top-level @file{Makefile} and @file{test-harness.el}
3910 handle the running and collection of results from the @code{Assert},
3911 @code{Check-Error}, @code{Check-Error-Message}, and @code{Check-Message}
3912 macros.
3913
3914 Don't suppress tests just because they're due to known bugs not yet
3915 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark
3916 them.
3917
3918 @deffn Macro Known-Bug-Expect-Failure body
3919 Arrange for failing tests in @var{body} to generate messages prefixed
3920 with "KNOWN BUG:" instead of "FAIL:". @var{body} is a @code{progn}-like
3921 body, and may contain several tests.
3922 @end deffn
3923
3924 A lot of the tests we run push limits; suppress Ebola warning messages
3925 with the @code{Ignore-Ebola} wrapper macro.
3926
3927 @deffn Macro Ignore-Ebola body
3928 Suppress Ebola warning messages while running tests in @var{body}.
3929 @var{body} is a @code{progn}-like body, and may contain several tests.
3930 @end deffn
3931
3932 Both macros are defined temporarily within the test function. Simple
3933 examples:
3934
3935 @example
3936 ;; Apparently Ignore-Ebola is a solution with no problem to address.
3937 ;; There are no examples in 21.5, anyway.
3938
3939 ;; from regexp-tests.el
3940 (Known-Bug-Expect-Failure
3941 (Assert (not (string-match "\\b" "")))
3942 (Assert (not (string-match " \\b" " "))))
3943 @end example
3944
3945 In general, you should avoid using functionality from packages in your
3946 tests, because you can't be sure that everyone will have the required
3947 package. However, if you've got a test that works, by all means add it.
3948 Simply wrap the test in an appropriate test, add a notice that the test
3949 was skipped, and update the @code{skipped-test-reasons} hashtable. The
3950 wrapper macro @code{Skip-Test-Unless} is provided to handle common
3951 cases.
3952
3953 @defvar skipped-test-reasons
3954 Hash table counting the number of times a particular reason is given for
3955 skipping tests. This is only defined within @code{test-emacs-test-file}.
3956 @end defvar
3957
3958 @deffn Macro Skip-Test-Unless prerequisite reason description body
3959 @var{prerequisite} is usually a feature test (@code{featurep},
3960 @code{boundp}, @code{fboundp}). @var{reason} is a string describing the
3961 prerequisite; it must be unique because it is used as a hash key in a
3962 table of reasons for skipping tests. @var{description} describes the
3963 tests being skipped, for the test result summary. @var{body} is a
3964 @code{progn}-like body, and may contain several tests.
3965 @end deffn
3966
3967 @code{Skip-Test-Unless} is defined temporarily within the test function.
3968 Here's an example of usage from @file{syntax-tests.el}:
3969
3970 @example
3971 ;; Test forward-comment at buffer boundaries
3972 (with-temp-buffer
3973 ;; try to use exactly what you need: featurep, boundp, fboundp
3974 (Skip-Test-Unless (fboundp 'c-mode)
3975 "c-mode unavailable"
3976 "comment and parse-partial-sexp tests"
3977 ;; and here's the test code
3978 (c-mode)
3979 (insert "// comment\n")
3980 (forward-comment -2)
3981 (Assert (eq (point) (point-min)))
3982 (let ((point (point)))
3983 (insert "/* comment */")
3984 (goto-char point)
3985 (forward-comment 2)
3986 (Assert (eq (point) (point-max)))
3987 (parse-partial-sexp point (point-max)))))
3988 @end example
3989
3990 @code{Skip-Test-Unless} is intended for use with features that are normally
3991 present in typical configurations. For truly optional features, or
3992 tests that apply to one of several alternative implementations (eg, to
3993 GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply
3994 silently suppress the test if the feature is not available.
3995
3996 Here are a few general hints for writing tests.
3997
3998 @enumerate
3999 @item
4000 Include related successful cases. Fixes often break something.
4001
4002 @item
4003 Use the Known-Bug-Expect-Failure macro to mark the cases you know
4004 are going to fail. We want to be able to distinguish between
4005 regressions and other unexpected failures, and cases that have
4006 been (partially) analyzed but not yet repaired.
4007
4008 @item
4009 Mark the bug with the date of report. An ``Unfixed since yyyy-mm-dd''
4010 gloss for Known-Bug-Expect-Failure is planned to further increase
4011 developer embarrassment (== incentive to fix the bug), but until then at
4012 least put a comment about the date so we can easily see when it was
4013 first reported.
4014
4015 @item
4016 It's a matter of your judgement, but you should often use generic tests
4017 (@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for
4018 numbers) even though you know that arguments ``should'' be of correct
4019 type. That is, if the functions used can return generic objects
4020 (typically @code{nil}), as well as some more specific type that will be
4021 returned on success. We don't want failures of those assertions
4022 reported as ``other failures'' (a wrong-type-arg signal, rather than a
4023 null return), we want them reported as ``assertion failures.''
4024
4025 One example is a test that tests @code{(= (string-match this that) 0)},
4026 expecting a successful match. Now suppose @code{string-match} is broken
4027 such that the match fails. Then it will return @code{nil}, and @code{=}
4028 will signal ``wrong-type-argument, number-char-or-marker-p, nil'',
4029 generating an ``other failure'' in the report. But this should be
4030 reported as an assertion failure (the test failed in a foreseeable way),
4031 rather than something else (we don't know what happened because XEmacs
4032 is broken in a way that we weren't trying to test!)
4033 @end enumerate
4034
4035 @node Modules for Regression Testing, , How to Regression-Test, Regression Testing XEmacs
4036 @section Modules for Regression Testing
4037 @cindex modules for regression testing
4038 @cindex regression testing, modules for
4039
4040 @example
4041 @file{test-harness.el}
4042 @file{base64-tests.el}
4043 @file{byte-compiler-tests.el}
4044 @file{case-tests.el}
4045 @file{ccl-tests.el}
4046 @file{c-tests.el}
4047 @file{database-tests.el}
4048 @file{extent-tests.el}
4049 @file{hash-table-tests.el}
4050 @file{lisp-tests.el}
4051 @file{md5-tests.el}
4052 @file{mule-tests.el}
4053 @file{regexp-tests.el}
4054 @file{symbol-tests.el}
4055 @file{syntax-tests.el}
4056 @file{tag-tests.el}
4057 @file{weak-tests.el}
4058 @end example
4059
4060 @file{test-harness.el} defines the macros @code{Assert},
4061 @code{Check-Error}, @code{Check-Error-Message}, and
4062 @code{Check-Message}. The other files are test files, testing various
4063 XEmacs facilities. @xref{Regression Testing XEmacs}.
4064
4065
4066 @node CVS Techniques, The Modules of XEmacs, Regression Testing XEmacs, Top
4067 @chapter CVS Techniques
4068 @cindex CVS techniques
4069
4070 @menu
4071 * Merging a Branch into the Trunk::
4072 @end menu
4073
4074 @node Merging a Branch into the Trunk, , CVS Techniques, CVS Techniques
4075 @section Merging a Branch into the Trunk
4076 @cindex merging a branch into the trunk
4077
4078 @enumerate
4079 @item
4080 If you haven't already done a merge, you will be merging from the branch
4081 point; otherwise you'll be merging from the last merge point, which
4082 should be marked by a tag, e.g. @samp{last-sync-ben-mule-21-5}. In the
4083 former case, create the last-sync tag, e.g.
4084
4085 @example
4086 crw rtag -r ben-mule-21-5-bp last-sync-ben-mule-21-5 xemacs
4087 @end example
4088
4089 (You did create a branch point tag when you created the branch, didn't
4090 you?)
4091
4092 @item
4093 Check everything in on your branch.
4094
4095 @item
4096 Tag your branch with a pre-sync tag, e.g.
4097
4098 @example
4099 crw rtag -r ben-mule-21-5 ben-mule-21-5-pre-feb-20-2002-sync xemacs
4100 @end example
4101
4102 Note, you need to use rtag and specify a version with @samp{-r} (use
4103 @samp{-r HEAD} if necessary) so that removed files are handled correctly
4104 in some obscure cases. See section 4.8 of the CVS manual.
4105
4106 @item
4107 Tag the trunk so you have a stable place to merge up to in case people
4108 are asynchronously committing to the trunk, e.g.
4109
4110 @example
4111 crw rtag -r HEAD main-branch-ben-mule-21-5-syncpoint-feb-20-2002 xemacs
4112 crw rtag -F -r main-branch-ben-mule-21-5-syncpoint-feb-20-2002 next-sync-ben-mule-21-5 xemacs
4113 @end example
4114
4115 Use -F in the second case because the name might already exist, e.g. if
4116 you've already done a merge. We make two tags because one is a
4117 permanent mark indicating a syncpoint when merging, and the other is a
4118 symbolic tag to make other operations easier.
4119
4120 @item
4121 Make a backup of your source tree (not totally necessary but useful for
4122 reference and peace of mind): Move one level up from the top directory
4123 of your branch and do, e.g.
4124
4125 @example
4126 cp -a mule mule-backup-2-23-02
4127 @end example
4128
4129 @item
4130 Now, we're ready to merge! Make sure you're in the top directory of
4131 your branch and do, e.g.
4132
4133 @example
4134 cvs update -j last-sync-ben-mule-21-5 -j next-sync-ben-mule-21-5
4135 @end example
4136
4137 @item
4138 Fix all merge conflicts. Get the sucker to compile and run.
4139
4140 @item
4141 Tag your branch with a post-sync tag, e.g.
4142
4143 @example
4144 crw rtag -r ben-mule-21-5 ben-mule-21-5-post-feb-20-2002-sync xemacs
4145 @end example
4146
4147 @item
4148 Update the last-sync tag, e.g.
4149
4150 @example
4151 crw rtag -F -r next-sync-ben-mule-21-5 last-sync-ben-mule-21-5 xemacs
4152 @end example
4153 @end enumerate
4154
4155
4156 @node The Modules of XEmacs, Allocation of Objects in XEmacs Lisp, CVS Techniques, Top
4157 @chapter The Modules of XEmacs 2060 @chapter The Modules of XEmacs
4158 @cindex modules of XEmacs 2061 @cindex modules of XEmacs
4159 2062
4160 @menu 2063 @menu
4161 * A Summary of the Various XEmacs Modules:: 2064 * A Summary of the Various XEmacs Modules::
5777 @end example 3680 @end example
5778 3681
5779 This module provides some terminal-control code necessary on versions of 3682 This module provides some terminal-control code necessary on versions of
5780 AIX prior to 4.1. 3683 AIX prior to 4.1.
5781 3684
5782 3685 @node Major Textual Changes, Rules When Writing New C Code, The Modules of XEmacs, Top
5783 @node Allocation of Objects in XEmacs Lisp, Dumping, The Modules of XEmacs, Top 3686 @chapter Major Textual Changes
3687 @cindex textual changes, major
3688 @cindex major textual changes
3689
3690 Sometimes major textual changes are made to the source. This means that
3691 a search-and-replace is done to change type names and such. Some people
3692 disagree with such changes, and certainly if done without good reason
3693 will just lead to headaches. But it's important to keep the code clean
3694 and understable, and consistent naming goes a long way towards this.
3695
3696 An example of the right way to do this was the so-called "great integral
3697 type renaming".
3698
3699 @menu
3700 * Great Integral Type Renaming::
3701 * Text/Char Type Renaming::
3702 @end menu
3703
3704 @node Great Integral Type Renaming, Text/Char Type Renaming, Major Textual Changes, Major Textual Changes
3705 @section Great Integral Type Renaming
3706 @cindex Great Integral Type Renaming
3707 @cindex integral type renaming, great
3708 @cindex type renaming, integral
3709 @cindex renaming, integral types
3710
3711 The purpose of this is to rationalize the names used for various
3712 integral types, so that they match their intended uses and follow
3713 consist conventions, and eliminate types that were not semantically
3714 different from each other.
3715
3716 The conventions are:
3717
3718 @itemize @bullet
3719 @item
3720 All integral types that measure quantities of anything are signed. Some
3721 people disagree vociferously with this, but their arguments are mostly
3722 theoretical, and are vastly outweighed by the practical headaches of
3723 mixing signed and unsigned values, and more importantly by the far
3724 increased likelihood of inadvertent bugs: Because of the broken "viral"
3725 nature of unsigned quantities in C (operations involving mixed
3726 signed/unsigned are done unsigned, when exactly the opposite is nearly
3727 always wanted), even a single error in declaring a quantity unsigned
3728 that should be signed, or even the even more subtle error of comparing
3729 signed and unsigned values and forgetting the necessary cast, can be
3730 catastrophic, as comparisons will yield wrong results. -Wsign-compare
3731 is turned on specifically to catch this, but this tends to result in a
3732 great number of warnings when mixing signed and unsigned, and the casts
3733 are annoying. More has been written on this elsewhere.
3734
3735 @item
3736 All such quantity types just mentioned boil down to EMACS_INT, which is
3737 32 bits on 32-bit machines and 64 bits on 64-bit machines. This is
3738 guaranteed to be the same size as Lisp objects of type @code{int}, and (as
3739 far as I can tell) of size_t (unsigned!) and ssize_t. The only type
3740 below that is not an EMACS_INT is Hashcode, which is an unsigned value
3741 of the same size as EMACS_INT.
3742
3743 @item
3744 Type names should be relatively short (no more than 10 characters or
3745 so), with the first letter capitalized and no underscores if they can at
3746 all be avoided.
3747
3748 @item
3749 "count" == a zero-based measurement of some quantity. Includes sizes,
3750 offsets, and indexes.
3751
3752 @item
3753 "bpos" == a one-based measurement of a position in a buffer. "Charbpos"
3754 and "Bytebpos" count text in the buffer, rather than bytes in memory;
3755 thus Bytebpos does not directly correspond to the memory representation.
3756 Use "Membpos" for this.
3757
3758 @item
3759 "Char" refers to internal-format characters, not to the C type "char",
3760 which is really a byte.
3761 @end itemize
3762
3763 For the actual name changes, see the script below.
3764
3765 I ran the following script to do the conversion. (NOTE: This script is
3766 idempotent. You can safely run it multiple times and it will not screw
3767 up previous results -- in fact, it will do nothing if nothing has
3768 changed. Thus, it can be run repeatedly as necessary to handle patches
3769 coming in from old workspaces, or old branches.) There are two tags,
3770 just before and just after the change: @samp{pre-integral-type-rename}
3771 and @samp{post-integral-type-rename}. When merging code from the main
3772 trunk into a branch, the best thing to do is first merge up to
3773 @samp{pre-integral-type-rename}, then apply the script and associated
3774 changes, then merge from @samp{post-integral-type-change} to the
3775 present. (Alternatively, just do the merging in one operation; but you
3776 may then have a lot of conflicts needing to be resolved by hand.)
3777
3778 Script @samp{fixtypes.sh} follows:
3779
3780 @example
3781 ----------------------------------- cut ------------------------------------
3782 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
3783 gr Memory_Count Bytecount $files
3784 gr Lstream_Data_Count Bytecount $files
3785 gr Element_Count Elemcount $files
3786 gr Hash_Code Hashcode $files
3787 gr extcount bytecount $files
3788 gr bufpos charbpos $files
3789 gr bytind bytebpos $files
3790 gr memind membpos $files
3791 gr bufbyte intbyte $files
3792 gr Extcount Bytecount $files
3793 gr Bufpos Charbpos $files
3794 gr Bytind Bytebpos $files
3795 gr Memind Membpos $files
3796 gr Bufbyte Intbyte $files
3797 gr EXTCOUNT BYTECOUNT $files
3798 gr BUFPOS CHARBPOS $files
3799 gr BYTIND BYTEBPOS $files
3800 gr MEMIND MEMBPOS $files
3801 gr BUFBYTE INTBYTE $files
3802 gr MEMORY_COUNT BYTECOUNT $files
3803 gr LSTREAM_DATA_COUNT BYTECOUNT $files
3804 gr ELEMENT_COUNT ELEMCOUNT $files
3805 gr HASH_CODE HASHCODE $files
3806 ----------------------------------- cut ------------------------------------
3807 @end example
3808
3809 The @samp{gr} script, and the scripts it uses, are documented in
3810 @file{README.global-renaming}, because if placed in this file they would
3811 need to have their @@ characters doubled, meaning you couldn't easily
3812 cut and paste from the source.
3813
3814 In addition to those programs, I needed to fix up a few other
3815 things, particularly relating to the duplicate definitions of
3816 types, now that some types merged with others. Specifically:
3817
3818 @enumerate
3819 @item
3820 in @file{lisp.h}, removed duplicate declarations of Bytecount. The changed
3821 code should now look like this: (In each code snippet below, the first
3822 and last lines are the same as the original, as are all lines outside of
3823 those lines. That allows you to locate the section to be replaced, and
3824 replace the stuff in that section, verifying that there isn't anything
3825 new added that would need to be kept.)
3826
3827 @example
3828 --------------------------------- snip -------------------------------------
3829 /* Counts of bytes or chars */
3830 typedef EMACS_INT Bytecount;
3831 typedef EMACS_INT Charcount;
3832
3833 /* Counts of elements */
3834 typedef EMACS_INT Elemcount;
3835
3836 /* Hash codes */
3837 typedef unsigned long Hashcode;
3838
3839 /* ------------------------ dynamic arrays ------------------- */
3840 --------------------------------- snip -------------------------------------
3841 @end example
3842
3843 @item
3844 in @file{lstream.h}, removed duplicate declaration of Bytecount. Rewrote the
3845 comment about this type. The changed code should now look like this:
3846
3847 @example
3848 --------------------------------- snip -------------------------------------
3849 #endif
3850
3851 /* The have been some arguments over the what the type should be that
3852 specifies a count of bytes in a data block to be written out or read in,
3853 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions.
3854 Originally it was long, which worked fine; Martin "corrected" these to
3855 size_t and ssize_t on the grounds that this is theoretically cleaner and
3856 is in keeping with the C standards. Unfortunately, this practice is
3857 horribly error-prone due to design flaws in the way that mixed
3858 signed/unsigned arithmetic happens. In fact, by doing this change,
3859 Martin introduced a subtle but fatal error that caused the operation of
3860 sending large mail messages to the SMTP server under Windows to fail.
3861 By putting all values back to be signed, avoiding any signed/unsigned
3862 mixing, the bug immediately went away. The type then in use was
3863 Lstream_Data_Count, so that it be reverted cleanly if a vote came to
3864 that. Now it is Bytecount.
3865
3866 Some earlier comments about why the type must be signed: This MUST BE
3867 SIGNED, since it also is used in functions that return the number of
3868 bytes actually read to or written from in an operation, and these
3869 functions can return -1 to signal error.
3870
3871 Note that the standard Unix @code{read()} and @code{write()} functions define the
3872 count going in as a size_t, which is UNSIGNED, and the count going
3873 out as an ssize_t, which is SIGNED. This is a horrible design
3874 flaw. Not only is it highly likely to lead to logic errors when a
3875 -1 gets interpreted as a large positive number, but operations are
3876 bound to fail in all sorts of horrible ways when a number in the
3877 upper-half of the size_t range is passed in -- this number is
3878 unrepresentable as an ssize_t, so code that checks to see how many
3879 bytes are actually written (which is mandatory if you are dealing
3880 with certain types of devices) will get completely screwed up.
3881
3882 --ben
3883 */
3884
3885 typedef enum lstream_buffering
3886 --------------------------------- snip -------------------------------------
3887 @end example
3888
3889 @item
3890 in @file{dumper.c}, there are four places, all inside of @code{switch()} statements,
3891 where XD_BYTECOUNT appears twice as a case tag. In each case, the two
3892 case blocks contain identical code, and you should *REMOVE THE SECOND*
3893 and leave the first.
3894 @end enumerate
3895
3896 @node Text/Char Type Renaming, , Great Integral Type Renaming, Major Textual Changes
3897 @section Text/Char Type Renaming
3898 @cindex Text/Char Type Renaming
3899 @cindex type renaming, text/char
3900 @cindex renaming, text/char types
3901
3902 The purpose of this was
3903
3904 @enumerate
3905 @item
3906 To distinguish between ``charptr'' when it refers to operations on
3907 the pointer itself and when it refers to operations on text
3908 @item
3909 To use consistent naming for everything referring to internal format, i.e.
3910 @end enumerate
3911
3912 @example
3913 Itext == text in internal format
3914 Ibyte == a byte in such text
3915 Ichar == a char as represented in internal character format
3916 @end example
3917
3918 Thus e.g.
3919
3920 @example
3921 set_charptr_emchar -> set_itext_ichar
3922 @end example
3923
3924 This was done using a script like this:
3925
3926 @example
3927 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
3928 gr Intbyte Ibyte $files
3929 gr INTBYTE IBYTE $files
3930 gr intbyte ibyte $files
3931 gr EMCHAR ICHAR $files
3932 gr emchar ichar $files
3933 gr Emchar Ichar $files
3934 gr INC_CHARPTR INC_IBYTEPTR $files
3935 gr DEC_CHARPTR DEC_IBYTEPTR $files
3936 gr VALIDATE_CHARPTR VALIDATE_IBYTEPTR $files
3937 gr valid_charptr valid_ibyteptr $files
3938 gr CHARPTR ITEXT $files
3939 gr charptr itext $files
3940 gr Charptr Itext $files
3941 @end example
3942
3943 See above for the source to @samp{gr}.
3944
3945 As in the integral-types change, there are pre and post tags before and
3946 after the change:
3947
3948 @example
3949 pre-internal-format-textual-renaming
3950 post-internal-format-textual-renaming
3951 @end example
3952
3953 When merging a large branch, follow the same sort of procedure
3954 documented above, using these tags -- essentially sync up to the pre
3955 tag, then apply the script yourself, then sync from the post tag to the
3956 present. You can probably do the same if you don't have a separate
3957 workspace, but do have lots of outstanding changes and you'd rather not
3958 just merge all the textual changes directly. Use something like this:
3959
3960 (WARNING: I'm not a CVS guru; before trying this, or any large operation
3961 that might potentially mess things up, @strong{DEFINITELY} make a backup of
3962 your existing workspace.)
3963
3964 @example
3965 cup -r pre-internal-format-textual-renaming
3966 <apply script>
3967 cup -A -j post-internal-format-textual-renaming -j HEAD
3968 @end example
3969
3970 This might also work:
3971
3972 @example
3973 cup -j pre-internal-format-textual-renaming
3974 <apply script>
3975 cup -j post-internal-format-textual-renaming -j HEAD
3976 @end example
3977
3978 ben
3979
3980 The following is a script to go in the opposite direction:
3981
3982 @example
3983 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
3984
3985 # Evidently Perl considers _ to be a word char ala \b, even though XEmacs
3986 # doesn't. We need to be careful here with ibyte/ichar because of words
3987 # like Richard, @code{eicharlen()}, multibyte, HIBYTE, etc.
3988
3989 gr Ibyte Intbyte $files
3990 gr '\bIBYTE' INTBYTE $files
3991 gr '\bibyte' intbyte $files
3992 gr '\bICHAR' EMCHAR $files
3993 gr '\bichar' emchar $files
3994 gr '\bIchar' Emchar $files
3995 gr '\bIBYTEPTR' CHARPTR $files
3996 gr '\bibyteptr' charptr $files
3997 gr '\bITEXT' CHARPTR $files
3998 gr '\bitext' charptr $files
3999 gr '\bItext' CHARPTR $files
4000
4001 gr '_IBYTE' _INTBYTE $files
4002 gr '_ibyte' _intbyte $files
4003 gr '_ICHAR' _EMCHAR $files
4004 gr '_ichar' _emchar $files
4005 gr '_Ichar' _Emchar $files
4006 gr '_IBYTEPTR' _CHARPTR $files
4007 gr '_ibyteptr' _charptr $files
4008 gr '_ITEXT' _CHARPTR $files
4009 gr '_itext' _charptr $files
4010 gr '_Itext' _CHARPTR $files
4011 @end example
4012
4013 @node Rules When Writing New C Code, Regression Testing XEmacs, Major Textual Changes, Top
4014 @chapter Rules When Writing New C Code
4015 @cindex writing new C code, rules when
4016 @cindex C code, rules when writing new
4017 @cindex code, rules when writing new C
4018
4019 The XEmacs C Code is extremely complex and intricate, and there are many
4020 rules that are more or less consistently followed throughout the code.
4021 Many of these rules are not obvious, so they are explained here. It is
4022 of the utmost importance that you follow them. If you don't, you may
4023 get something that appears to work, but which will crash in odd
4024 situations, often in code far away from where the actual breakage is.
4025
4026 @menu
4027 * A Reader's Guide to XEmacs Coding Conventions::
4028 * General Coding Rules::
4029 * Object-Oriented Techniques for C::
4030 * Writing Lisp Primitives::
4031 * Writing Good Comments::
4032 * Adding Global Lisp Variables::
4033 * Writing Macros::
4034 * Proper Use of Unsigned Types::
4035 * Techniques for XEmacs Developers::
4036 @end menu
4037
4038 See also @ref{Coding for Mule}.
4039
4040 @node A Reader's Guide to XEmacs Coding Conventions, General Coding Rules, Rules When Writing New C Code, Rules When Writing New C Code
4041 @section A Reader's Guide to XEmacs Coding Conventions
4042 @cindex coding conventions
4043 @cindex reader's guide
4044 @cindex coding rules, naming
4045
4046 Of course the low-level implementation language of XEmacs is C, but much
4047 of that uses the Lisp engine to do its work. However, because the code
4048 is ``inside'' of the protective containment shell around the ``reactor
4049 core,'' you'll see lots of complex ``plumbing'' needed to do the work
4050 and ``safety mechanisms,'' whose failure results in a meltdown. This
4051 section provides a quick overview (or review) of the various components
4052 of the implementation of Lisp objects.
4053
4054 Two typographic conventions help to identify C objects that implement
4055 Lisp objects. The first is that capitalized identifiers, especially
4056 beginning with the letters @samp{Q}, @samp{V}, @samp{F}, and @samp{S},
4057 for C variables and functions, and C macros with beginning with the
4058 letter @samp{X}, are used to implement Lisp. The second is that where
4059 Lisp uses the hyphen @samp{-} in symbol names, the corresponding C
4060 identifiers use the underscore @samp{_}. Of course, since XEmacs Lisp
4061 contains interfaces to many external libraries, those external names
4062 will follow the coding conventions their authors chose, and may overlap
4063 the ``XEmacs name space.'' However these cases are usually pretty
4064 obvious.
4065
4066 All Lisp objects are handled indirectly. The @code{Lisp_Object}
4067 type is usually a pointer to a structure, except for a very small number
4068 of types with immediate representations (currently characters and
4069 integers). However, these types cannot be directly operated on in C
4070 code, either, so they can also be considered indirect. Types that do
4071 not have an immediate representation always have a C typedef
4072 @code{Lisp_@var{type}} for a corresponding structure.
4073 @c #### mention l(c)records here?
4074
4075 In older code, it was common practice to pass around pointers to
4076 @code{Lisp_@var{type}}, but this is now deprecated in favor of using
4077 @code{Lisp_Object} for all function arguments and return values that are
4078 Lisp objects. The @code{X@var{type}} macro is used to extract the
4079 pointer and cast it to @code{(Lisp_@var{type} *)} for the desired type.
4080
4081 @strong{Convention}: macros whose names begin with @samp{X} operate on
4082 @code{Lisp_Object}s and do no type-checking. Many such macros are type
4083 extractors, but others implement Lisp operations in C (@emph{e.g.},
4084 @code{XCAR} implements the Lisp @code{car} function). These are unsafe,
4085 and must only be used where types of all data have already been checked.
4086 Such macros are only applied to @code{Lisp_Object}s. In internal
4087 implementations where the pointer has already been converted, the
4088 structure is operated on directly using the C @code{->} member access
4089 operator.
4090
4091 The @code{@var{type}P}, @code{CHECK_@var{type}}, and
4092 @code{CONCHECK_@var{type}} macros are used to test types. The first
4093 returns a Boolean value, and the latter signal errors. (The
4094 @samp{CONCHECK} variety allows execution to be CONtinued under some
4095 circumstances, thus the name.) Functions which expect to be passed user
4096 data invariably call @samp{CHECK} macros on arguments.
4097
4098 There are many types of specialized Lisp objects implemented in C, but
4099 the most pervasive type is the @dfn{symbol}. Symbols are used as
4100 identifiers, variables, and functions.
4101
4102 @strong{Convention}: Global variables whose names begin with @samp{Q}
4103 are constants whose value is a symbol. The name of the variable should
4104 be derived from the name of the symbol using the same rules as for Lisp
4105 primitives. Such variables allow the C code to check whether a
4106 particular @code{Lisp_Object} is equal to a given symbol. Symbols are
4107 Lisp objects, so these variables may be passed to Lisp primitives. (An
4108 alternative to the use of @samp{Q...} variables is to call the
4109 @code{intern} function at initialization in the
4110 @code{vars_of_@var{module}} function, which is hardly less efficient.)
4111
4112 @strong{Convention}: Global variables whose names begin with @samp{V}
4113 are variables that contain Lisp objects. The convention here is that
4114 all global variables of type @code{Lisp_Object} begin with @samp{V}, and
4115 no others do (not even integer and boolean variables that have Lisp
4116 equivalents). Most of the time, these variables have equivalents in
4117 Lisp, which are defined via the @samp{DEFVAR} family of macros, but some
4118 don't. Since the variable's value is a @code{Lisp_Object}, it can be
4119 passed to Lisp primitives.
4120
4121 The implementation of Lisp primitives is more complex.
4122 @strong{Convention}: Global variables with names beginning with @samp{S}
4123 contain a structure that allows the Lisp engine to identify and call a C
4124 function. In modern versions of XEmacs, these identifiers are almost
4125 always completely hidden in the @code{DEFUN} and @code{SUBR} macros, but
4126 you will encounter them if you look at very old versions of XEmacs or at
4127 GNU Emacs. @strong{Convention}: Functions with names beginning with
4128 @samp{F} implement Lisp primitives. Of course all their arguments and
4129 their return values must be Lisp_Objects. (This is hidden in the
4130 @code{DEFUN} macro.)
4131
4132
4133 @node General Coding Rules, Object-Oriented Techniques for C, A Reader's Guide to XEmacs Coding Conventions, Rules When Writing New C Code
4134 @section General Coding Rules
4135 @cindex coding rules, general
4136
4137 The C code is actually written in a dialect of C called @dfn{Clean C},
4138 meaning that it can be compiled, mostly warning-free, with either a C or
4139 C++ compiler. Coding in Clean C has several advantages over plain C.
4140 C++ compilers are more nit-picking, and a number of coding errors have
4141 been found by compiling with C++. The ability to use both C and C++
4142 tools means that a greater variety of development tools are available to
4143 the developer. In addition, the ability to overload operators in C++
4144 means it is possible, for error-checking purposes, to redefine certain
4145 simple types (normally defined as aliases for simple built-in types such
4146 as @code{unsigned char} or @code{long}) as classes, strictly limiting the permissible
4147 operations and catching illegal implicit casts and such.
4148
4149 Every module includes @file{<config.h>} (angle brackets so that
4150 @samp{--srcdir} works correctly; @file{config.h} may or may not be in
4151 the same directory as the C sources) and @file{lisp.h}. @file{config.h}
4152 must always be included before any other header files (including
4153 system header files) to ensure that certain tricks played by various
4154 @file{s/} and @file{m/} files work out correctly.
4155
4156 When including header files, always use angle brackets, not double
4157 quotes, except when the file to be included is always in the same
4158 directory as the including file. If either file is a generated file,
4159 then that is not likely to be the case. In order to understand why we
4160 have this rule, imagine what happens when you do a build in the source
4161 directory using @samp{./configure} and another build in another
4162 directory using @samp{../work/configure}. There will be two different
4163 @file{config.h} files. Which one will be used if you @samp{#include
4164 "config.h"}?
4165
4166 Almost every module contains a @code{syms_of_*()} function and a
4167 @code{vars_of_*()} function. The former declares any Lisp primitives
4168 you have defined and defines any symbols you will be using. The latter
4169 declares any global Lisp variables you have added and initializes global
4170 C variables in the module. @strong{Important}: There are stringent
4171 requirements on exactly what can go into these functions. See the
4172 comment in @file{emacs.c}. The reason for this is to avoid obscure
4173 unwanted interactions during initialization. If you don't follow these
4174 rules, you'll be sorry! If you want to do anything that isn't allowed,
4175 create a @code{complex_vars_of_*()} function for it. Doing this is
4176 tricky, though: you have to make sure your function is called at the
4177 right time so that all the initialization dependencies work out.
4178
4179 Declare each function of these kinds in @file{symsinit.h}. Make sure
4180 it's called in the appropriate place in @file{emacs.c}. You never need
4181 to include @file{symsinit.h} directly, because it is included by
4182 @file{lisp.h}.
4183
4184 @strong{All global and static variables that are to be modifiable must
4185 be declared uninitialized.} This means that you may not use the
4186 ``declare with initializer'' form for these variables, such as @code{int
4187 some_variable = 0;}. The reason for this has to do with some kludges
4188 done during the dumping process: If possible, the initialized data
4189 segment is re-mapped so that it becomes part of the (unmodifiable) code
4190 segment in the dumped executable. This allows this memory to be shared
4191 among multiple running XEmacs processes. XEmacs is careful to place as
4192 much constant data as possible into initialized variables during the
4193 @file{temacs} phase.
4194
4195 @cindex copy-on-write
4196 @strong{Please note:} This kludge only works on a few systems nowadays,
4197 and is rapidly becoming irrelevant because most modern operating systems
4198 provide @dfn{copy-on-write} semantics. All data is initially shared
4199 between processes, and a private copy is automatically made (on a
4200 page-by-page basis) when a process first attempts to write to a page of
4201 memory.
4202
4203 Formerly, there was a requirement that static variables not be declared
4204 inside of functions. This had to do with another hack along the same
4205 vein as what was just described: old USG systems put statically-declared
4206 variables in the initialized data space, so those header files had a
4207 @code{#define static} declaration. (That way, the data-segment remapping
4208 described above could still work.) This fails badly on static variables
4209 inside of functions, which suddenly become automatic variables;
4210 therefore, you weren't supposed to have any of them. This awful kludge
4211 has been removed in XEmacs because
4212
4213 @enumerate
4214 @item
4215 almost all of the systems that used this kludge ended up having
4216 to disable the data-segment remapping anyway;
4217 @item
4218 the only systems that didn't were extremely outdated ones;
4219 @item
4220 this hack completely messed up inline functions.
4221 @end enumerate
4222
4223 The C source code makes heavy use of C preprocessor macros. One popular
4224 macro style is:
4225
4226 @example
4227 #define FOO(var, value) do @{ \
4228 Lisp_Object FOO_value = (value); \
4229 ... /* compute using FOO_value */ \
4230 (var) = bar; \
4231 @} while (0)
4232 @end example
4233
4234 The @code{do @{...@} while (0)} is a standard trick to allow FOO to have
4235 statement semantics, so that it can safely be used within an @code{if}
4236 statement in C, for example. Multiple evaluation is prevented by
4237 copying a supplied argument into a local variable, so that
4238 @code{FOO(var,fun(1))} only calls @code{fun} once.
4239
4240 Lisp lists are popular data structures in the C code as well as in
4241 Elisp. There are two sets of macros that iterate over lists.
4242 @code{EXTERNAL_LIST_LOOP_@var{n}} should be used when the list has been
4243 supplied by the user, and cannot be trusted to be acyclic and
4244 @code{nil}-terminated. A @code{malformed-list} or @code{circular-list} error
4245 will be generated if the list being iterated over is not entirely
4246 kosher. @code{LIST_LOOP_@var{n}}, on the other hand, is faster and less
4247 safe, and can be used only on trusted lists.
4248
4249 Related macros are @code{GET_EXTERNAL_LIST_LENGTH} and
4250 @code{GET_LIST_LENGTH}, which calculate the length of a list, and in the
4251 case of @code{GET_EXTERNAL_LIST_LENGTH}, validating the properness of
4252 the list. The macros @code{EXTERNAL_LIST_LOOP_DELETE_IF} and
4253 @code{LIST_LOOP_DELETE_IF} delete elements from a lisp list satisfying some
4254 predicate.
4255
4256 @node Object-Oriented Techniques for C, Writing Lisp Primitives, General Coding Rules, Rules When Writing New C Code
4257 @section Object-Oriented Techniques for C
4258 @cindex coding rules, object-oriented
4259 @cindex object-oriented techniques
4260
4261 At the lowest levels, XEmacs makes heavy use of object-oriented
4262 techniques to promote code-sharing and uniform interfaces for different
4263 devices and platforms. Commonly, but not always, such objects are
4264 ``wrapped'' and exported to Lisp as Lisp objects. Usually they use
4265 the internal structures developed for Lisp objects (the @samp{lrecord}
4266 structure) in order to take advantage of Lisp memory management.
4267 Unfortunately, XEmacs was originally written in C, so these techniques
4268 are based on heavy use of C macros.
4269
4270 @c You can't use @var{} for type below, because case is important.
4271 A module defining a class is likely to use most of the following
4272 declarations and macros. In the following, the notation @samp{<type>}
4273 will stand for the full name of the class, and will be capitalized in
4274 the way normal for its context. The notation @samp{<typ>} will stand
4275 for the abbreviated form commonly used in macro names, while @samp{ty}
4276 will be used as the typical name for instances of the class. (See the
4277 entry for @samp{MAYBE_<TY>METH} below for an example using all three
4278 notations.)
4279
4280 In the interface (@file{.h} file), the following declarations are used
4281 often. Others may be used in for particular modules. Since they're
4282 quite short in most cases, the definitions are given as well. The
4283 generic macros used are defined in @file{lisp.h} or @file{lrecord.h}.
4284
4285 @c #### reorganize this table into stuff used in general code, and stuff
4286 @c used only in declarations or initializations
4287 @table @samp
4288 @c #### declaration
4289 @item typedef struct Lisp_<Type> Lisp_<Type>
4290 This refers to the internal structure used by C code. The XEmacs coding
4291 style now forbids passing pointers to @samp{Lisp_<Type>} structures into
4292 or out of a function; instead, a @samp{Lisp_Object} should be passed or
4293 returned (created using @samp{wrap_<type>}, if necessary).
4294
4295 @c #### declaration
4296 @item DECLARE_LRECORD (<type>, Lisp_<Type>)
4297 Declares an @samp{lrecord} for @samp{<Type>}, which is the unit of
4298 allocation.
4299
4300 @item #define X<TYPE>(x) XRECORD (x, <type>, Lisp_<Type>)
4301 Turns a @code{Lisp_Object} into a pointer to @samp{struct Lisp_<Type>}.
4302
4303 @item #define wrap_<type>(p) wrap_record (p, <type>)
4304 Turns a pointer to @samp{struct Lisp_<Type>} into a @code{Lisp_Object}.
4305
4306 @item #define <TYPE>P(x) RECORDP (x, <type>)
4307 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}.
4308 Returns a C int, not a Lisp Boolean value.
4309
4310 @item #define CHECK_<TYPE>(x) CHECK_RECORD (x, <type>)
4311 @itemx #define CONCHECK_<TYPE>(x) CONCHECK_RECORD (x, <type>)
4312 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>},
4313 and signals a Lisp error if not. The @samp{CHECK} version of the macro
4314 never returns if the type is wrong, while the @samp{CONCHECK} version
4315 can return if the user catches it in the debugger and explicitly
4316 requests a return.
4317
4318 @item #define RAW_<TYP>METH(ty, m) ((ty)->methods->m##_method)
4319 Return a function pointer for the method for an object @var{TY} of class
4320 @samp{Lisp_<Type>}, or @samp{NULL} if there is none for this type.
4321
4322 @item #define HAS_<TYP>METH_P(ty, m) (!!RAW_<TYP>METH (ty, m))
4323 Test whether the class that @var{TY} is an instance of has the method.
4324
4325 @item #define <TYP>METH(ty, m, args) ((RAW_<TYP>METH (ty, m)) args)
4326 Call the method on @samp{args}. @samp{args} must be enclosed in
4327 parentheses in the call. It is the programmer's responsibility to
4328 ensure that the method is available. The standard convenience macro
4329 @samp{MAYBE_<TYP>METH} is often provided for the common case where a
4330 void-returning method of @samp{Type} is called.
4331
4332 @item #define MAYBE_<TYP>METH(ty, m, args) do @{ ... @} while (0)
4333 Call a void-returning @samp{<Type>} method, if it exists. Note the use
4334 of the @samp{do ... while (0)} idiom to give the macro call C statement
4335 semantics. The full definition is equally idiomatic:
4336
4337 @example
4338 #define MAYBE_<TYP>METH(ty, m, args) do @{ \
4339 Lisp_<Type> *maybe_<typ>meth_ty = (ty); \
4340 if (HAS_<TYP>METH_P (maybe_<typ>meth_ty, m)) \
4341 <TYP>METH (maybe_<typ>meth_ty, m, args); \
4342 @} while (0)
4343 @end example
4344 @end table
4345
4346 The use of macros for invoking an object's methods makes life a bit
4347 difficult for the student or maintainer when browsing the code. In
4348 particular, calls are of the form @samp{<TYP>METH (ty, some_method, (x,
4349 y))}, but definitions typically are for @samp{<subtype>_some_method}.
4350 Thus, when you are trying to find calls, you need to grep for
4351 @samp{some_method}, but this will also catch calls and definitions of
4352 that method for instances of other subtypes of @samp{<Type>}, and there
4353 may be a rather large number of them.
4354
4355
4356 @node Writing Lisp Primitives, Writing Good Comments, Object-Oriented Techniques for C, Rules When Writing New C Code
4357 @section Writing Lisp Primitives
4358 @cindex writing Lisp primitives
4359 @cindex Lisp primitives, writing
4360 @cindex primitives, writing Lisp
4361
4362 Lisp primitives are Lisp functions implemented in C. The details of
4363 interfacing the C function so that Lisp can call it are handled by a few
4364 C macros. The only way to really understand how to write new C code is
4365 to read the source, but we can explain some things here.
4366
4367 An example of a special form is the definition of @code{prog1}, from
4368 @file{eval.c}. (An ordinary function would have the same general
4369 appearance.)
4370
4371 @cindex garbage collection protection
4372 @smallexample
4373 @group
4374 DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /*
4375 Similar to `progn', but the value of the first form is returned.
4376 \(prog1 FIRST BODY...): All the arguments are evaluated sequentially.
4377 The value of FIRST is saved during evaluation of the remaining args,
4378 whose values are discarded.
4379 */
4380 (args))
4381 @{
4382 /* This function can GC */
4383 REGISTER Lisp_Object val, form, tail;
4384 struct gcpro gcpro1;
4385
4386 val = Feval (XCAR (args));
4387
4388 GCPRO1 (val);
4389
4390 LIST_LOOP_3 (form, XCDR (args), tail)
4391 Feval (form);
4392
4393 UNGCPRO;
4394 return val;
4395 @}
4396 @end group
4397 @end smallexample
4398
4399 Let's start with a precise explanation of the arguments to the
4400 @code{DEFUN} macro. Here is a template for them:
4401
4402 @example
4403 @group
4404 DEFUN (@var{lname}, @var{fname}, @var{min_args}, @var{max_args}, @var{interactive}, /*
4405 @var{docstring}
4406 */
4407 (@var{arglist}))
4408 @end group
4409 @end example
4410
4411 @table @var
4412 @item lname
4413 This string is the name of the Lisp symbol to define as the function
4414 name; in the example above, it is @code{"prog1"}.
4415
4416 @item fname
4417 This is the C function name for this function. This is the name that is
4418 used in C code for calling the function. The name is, by convention,
4419 @samp{F} prepended to the Lisp name, with all dashes (@samp{-}) in the
4420 Lisp name changed to underscores. Thus, to call this function from C
4421 code, call @code{Fprog1}. Remember that the arguments are of type
4422 @code{Lisp_Object}; various macros and functions for creating values of
4423 type @code{Lisp_Object} are declared in the file @file{lisp.h}.
4424
4425 Primitives whose names are special characters (e.g. @code{+} or
4426 @code{<}) are named by spelling out, in some fashion, the special
4427 character: e.g. @code{Fplus()} or @code{Flss()}. Primitives whose names
4428 begin with normal alphanumeric characters but also contain special
4429 characters are spelled out in some creative way, e.g. @code{let*}
4430 becomes @code{FletX()}.
4431
4432 Each function also has an associated structure that holds the data for
4433 the subr object that represents the function in Lisp. This structure
4434 conveys the Lisp symbol name to the initialization routine that will
4435 create the symbol and store the subr object as its definition. The C
4436 variable name of this structure is always @samp{S} prepended to the
4437 @var{fname}. You hardly ever need to be aware of the existence of this
4438 structure, since @code{DEFUN} plus @code{DEFSUBR} takes care of all the
4439 details.
4440
4441 @item min_args
4442 This is the minimum number of arguments that the function requires. The
4443 function @code{prog1} allows a minimum of one argument.
4444
4445 @item max_args
4446 This is the maximum number of arguments that the function accepts, if
4447 there is a fixed maximum. Alternatively, it can be @code{UNEVALLED},
4448 indicating a special form that receives unevaluated arguments, or
4449 @code{MANY}, indicating an unlimited number of evaluated arguments (the
4450 C equivalent of @code{&rest}). Both @code{UNEVALLED} and @code{MANY}
4451 are macros. If @var{max_args} is a number, it may not be less than
4452 @var{min_args} and it may not be greater than 8. (If you need to add a
4453 function with more than 8 arguments, use the @code{MANY} form. Resist
4454 the urge to edit the definition of @code{DEFUN} in @file{lisp.h}. If
4455 you do it anyways, make sure to also add another clause to the switch
4456 statement in @code{primitive_funcall().})
4457
4458 @item interactive
4459 This is an interactive specification, a string such as might be used as
4460 the argument of @code{interactive} in a Lisp function. In the case of
4461 @code{prog1}, it is 0 (a null pointer), indicating that @code{prog1}
4462 cannot be called interactively. A value of @code{""} indicates a
4463 function that should receive no arguments when called interactively.
4464
4465 @item docstring
4466 This is the documentation string. It is written just like a
4467 documentation string for a function defined in Lisp; in particular, the
4468 first line should be a single sentence. Note how the documentation
4469 string is enclosed in a comment, none of the documentation is placed on
4470 the same lines as the comment-start and comment-end characters, and the
4471 comment-start characters are on the same line as the interactive
4472 specification. @file{make-docfile}, which scans the C files for
4473 documentation strings, is very particular about what it looks for, and
4474 will not properly extract the doc string if it's not in this exact format.
4475
4476 In order to make both @file{etags} and @file{make-docfile} happy, make
4477 sure that the @code{DEFUN} line contains the @var{lname} and
4478 @var{fname}, and that the comment-start characters for the doc string
4479 are on the same line as the interactive specification, and put a newline
4480 directly after them (and before the comment-end characters).
4481
4482 @item arglist
4483 This is the comma-separated list of arguments to the C function. For a
4484 function with a fixed maximum number of arguments, provide a C argument
4485 for each Lisp argument. In this case, unlike regular C functions, the
4486 types of the arguments are not declared; they are simply always of type
4487 @code{Lisp_Object}.
4488
4489 The names of the C arguments will be used as the names of the arguments
4490 to the Lisp primitive as displayed in its documentation, modulo the same
4491 concerns described above for @code{F...} names (in particular,
4492 underscores in the C arguments become dashes in the Lisp arguments).
4493
4494 There is one additional kludge: A trailing @samp{_} on the C argument is
4495 discarded when forming the Lisp argument. This allows C language
4496 reserved words (like @code{default}) or global symbols (like
4497 @code{dirname}) to be used as argument names without compiler warnings
4498 or errors.
4499
4500 A Lisp function with @w{@var{max_args} = @code{UNEVALLED}} is a
4501 @w{@dfn{special form}}; its arguments are not evaluated. Instead it
4502 receives one argument of type @code{Lisp_Object}, a (Lisp) list of the
4503 unevaluated arguments, conventionally named @code{(args)}.
4504
4505 When a Lisp function has no upper limit on the number of arguments,
4506 specify @w{@var{max_args} = @code{MANY}}. In this case its implementation in
4507 C actually receives exactly two arguments: the number of Lisp arguments
4508 (an @code{int}) and the address of a block containing their values (a
4509 @w{@code{Lisp_Object *}}). In this case only are the C types specified
4510 in the @var{arglist}: @w{@code{(int nargs, Lisp_Object *args)}}.
4511
4512 @end table
4513
4514 Within the function @code{Fprog1} itself, note the use of the macros
4515 @code{GCPRO1} and @code{UNGCPRO}. @code{GCPRO1} is used to ``protect''
4516 a variable from garbage collection---to inform the garbage collector
4517 that it must look in that variable and regard the object pointed at by
4518 its contents as an accessible object. This is necessary whenever you
4519 call @code{Feval} or anything that can directly or indirectly call
4520 @code{Feval} (this includes the @code{QUIT} macro!). At such a time,
4521 any Lisp object that you intend to refer to again must be protected
4522 somehow. @code{UNGCPRO} cancels the protection of the variables that
4523 are protected in the current function. It is necessary to do this
4524 explicitly.
4525
4526 The macro @code{GCPRO1} protects just one local variable. If you want
4527 to protect two, use @code{GCPRO2} instead; repeating @code{GCPRO1} will
4528 not work. Macros @code{GCPRO3} and @code{GCPRO4} also exist.
4529
4530 These macros implicitly use local variables such as @code{gcpro1}; you
4531 must declare these explicitly, with type @code{struct gcpro}. Thus, if
4532 you use @code{GCPRO2}, you must declare @code{gcpro1} and @code{gcpro2}.
4533
4534 @cindex caller-protects (@code{GCPRO} rule)
4535 Note also that the general rule is @dfn{caller-protects}; i.e. you are
4536 only responsible for protecting those Lisp objects that you create. Any
4537 objects passed to you as arguments should have been protected by whoever
4538 created them, so you don't in general have to protect them.
4539
4540 In particular, the arguments to any Lisp primitive are always
4541 automatically @code{GCPRO}ed, when called ``normally'' from Lisp code or
4542 bytecode. So only a few Lisp primitives that are called frequently from
4543 C code, such as @code{Fprogn} protect their arguments as a service to
4544 their caller. You don't need to protect your arguments when writing a
4545 new @code{DEFUN}.
4546
4547 @code{GCPRO}ing is perhaps the trickiest and most error-prone part of
4548 XEmacs coding. It is @strong{extremely} important that you get this
4549 right and use a great deal of discipline when writing this code.
4550 @xref{GCPROing, ,@code{GCPRO}ing}, for full details on how to do this.
4551
4552 What @code{DEFUN} actually does is declare a global structure of type
4553 @code{Lisp_Subr} whose name begins with capital @samp{SF} and which
4554 contains information about the primitive (e.g. a pointer to the
4555 function, its minimum and maximum allowed arguments, a string describing
4556 its Lisp name); @code{DEFUN} then begins a normal C function declaration
4557 using the @code{F...} name. The Lisp subr object that is the function
4558 definition of a primitive (i.e. the object in the function slot of the
4559 symbol that names the primitive) actually points to this @samp{SF}
4560 structure; when @code{Feval} encounters a subr, it looks in the
4561 structure to find out how to call the C function.
4562
4563 Defining the C function is not enough to make a Lisp primitive
4564 available; you must also create the Lisp symbol for the primitive (the
4565 symbol is @dfn{interned}; @pxref{Obarrays}) and store a suitable subr
4566 object in its function cell. (If you don't do this, the primitive won't
4567 be seen by Lisp code.) The code looks like this:
4568
4569 @example
4570 DEFSUBR (@var{fname});
4571 @end example
4572
4573 @noindent
4574 Here @var{fname} is the same name you used as the second argument to
4575 @code{DEFUN}.
4576
4577 This call to @code{DEFSUBR} should go in the @code{syms_of_*()} function
4578 at the end of the module. If no such function exists, create it and
4579 make sure to also declare it in @file{symsinit.h} and call it from the
4580 appropriate spot in @code{main()}. @xref{General Coding Rules}.
4581
4582 Note that C code cannot call functions by name unless they are defined
4583 in C. The way to call a function written in Lisp from C is to use
4584 @code{Ffuncall}, which embodies the Lisp function @code{funcall}. Since
4585 the Lisp function @code{funcall} accepts an unlimited number of
4586 arguments, in C it takes two: the number of Lisp-level arguments, and a
4587 one-dimensional array containing their values. The first Lisp-level
4588 argument is the Lisp function to call, and the rest are the arguments to
4589 pass to it. Since @code{Ffuncall} can call the evaluator, you must
4590 protect pointers from garbage collection around the call to
4591 @code{Ffuncall}. (However, @code{Ffuncall} explicitly protects all of
4592 its parameters, so you don't have to protect any pointers passed as
4593 parameters to it.)
4594
4595 The C functions @code{call0}, @code{call1}, @code{call2}, and so on,
4596 provide handy ways to call a Lisp function conveniently with a fixed
4597 number of arguments. They work by calling @code{Ffuncall}.
4598
4599 @file{eval.c} is a very good file to look through for examples;
4600 @file{lisp.h} contains the definitions for important macros and
4601 functions.
4602
4603 @node Writing Good Comments, Adding Global Lisp Variables, Writing Lisp Primitives, Rules When Writing New C Code
4604 @section Writing Good Comments
4605 @cindex writing good comments
4606 @cindex comments, writing good
4607
4608 Comments are a lifeline for programmers trying to understand tricky
4609 code. In general, the less obvious it is what you are doing, the more
4610 you need a comment, and the more detailed it needs to be. You should
4611 always be on guard when you're writing code for stuff that's tricky, and
4612 should constantly be putting yourself in someone else's shoes and asking
4613 if that person could figure out without much difficulty what's going
4614 on. (Assume they are a competent programmer who understands the
4615 essentials of how the XEmacs code is structured but doesn't know much
4616 about the module you're working on or any algorithms you're using.) If
4617 you're not sure whether they would be able to, add a comment. Always
4618 err on the side of more comments, rather than less.
4619
4620 Generally, when making comments, there is no need to attribute them with
4621 your name or initials. This especially goes for small,
4622 easy-to-understand, non-opinionated ones. Also, comments indicating
4623 where, when, and by whom a file was changed are @emph{strongly}
4624 discouraged, and in general will be removed as they are discovered.
4625 This is exactly what @file{ChangeLogs} are there for. However, it can
4626 occasionally be useful to mark exactly where (but not when or by whom)
4627 changes are made, particularly when making small changes to a file
4628 imported from elsewhere. These marks help when later on a newer version
4629 of the file is imported and the changes need to be merged. (If
4630 everything were always kept in CVS, there would be no need for this.
4631 But in practice, this often doesn't happen, or the CVS repository is
4632 later on lost or unavailable to the person doing the update.)
4633
4634 When putting in an explicit opinion in a comment, you should
4635 @emph{always} attribute it with your name and the date. This also goes
4636 for long, complex comments explaining in detail the workings of
4637 something -- by putting your name there, you make it possible for
4638 someone who has questions about how that thing works to determine who
4639 wrote the comment so they can write to them. Use your actual name or
4640 your alias at xemacs.org, and not your initials or nickname, unless that
4641 is generally recognized (e.g. @samp{jwz}). Even then, please consider
4642 requesting a virtual user at xemacs.org (forwarding address; we can't
4643 provide an actual mailbox). Otherwise, give first and last name. If
4644 you're not a regular contributor, you might consider putting your email
4645 address in -- it may be in the ChangeLog, but after awhile ChangeLogs
4646 have a tendency of disappearing or getting muddled. (E.g. your comment
4647 may get copied somewhere else or even into another program, and tracking
4648 down the proper ChangeLog may be very difficult.)
4649
4650 If you come across an opinion that is not or is no longer valid, or you
4651 come across any comment that no longer applies but you want to keep it
4652 around, enclose it in @samp{[[ } and @samp{ ]]} marks and add a comment
4653 afterwards explaining why the preceding comment is no longer valid. Put
4654 your name on this comment, as explained above.
4655
4656 Just as comments are a lifeline to programmers, incorrect comments are
4657 death. If you come across an incorrect comment, @strong{immediately}
4658 correct it or flag it as incorrect, as described in the previous
4659 paragraph. Whenever you work on a section of code, @emph{always} make
4660 sure to update any comments to be correct -- or, at the very least, flag
4661 them as incorrect.
4662
4663 To indicate a "todo" or other problem, use four pound signs --
4664 i.e. @samp{####}.
4665
4666 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code
4667 @section Adding Global Lisp Variables
4668 @cindex global Lisp variables, adding
4669 @cindex variables, adding global Lisp
4670
4671 Global variables whose names begin with @samp{Q} are constants whose
4672 value is a symbol of a particular name. The name of the variable should
4673 be derived from the name of the symbol using the same rules as for Lisp
4674 primitives. These variables are initialized using a call to
4675 @code{defsymbol()} in the @code{syms_of_*()} function. (This call
4676 interns a symbol, sets the C variable to the resulting Lisp object, and
4677 calls @code{staticpro()} on the C variable to tell the
4678 garbage-collection mechanism about this variable. What
4679 @code{staticpro()} does is add a pointer to the variable to a large
4680 global array; when garbage-collection happens, all pointers listed in
4681 the array are used as starting points for marking Lisp objects. This is
4682 important because it's quite possible that the only current reference to
4683 the object is the C variable. In the case of symbols, the
4684 @code{staticpro()} doesn't matter all that much because the symbol is
4685 contained in @code{obarray}, which is itself @code{staticpro()}ed.
4686 However, it's possible that a naughty user could do something like
4687 uninterning the symbol out of @code{obarray} or even setting
4688 @code{obarray} to a different value [although this is likely to make
4689 XEmacs crash!].)
4690
4691 @strong{Please note:} It is potentially deadly if you declare a
4692 @samp{Q...} variable in two different modules. The two calls to
4693 @code{defsymbol()} are no problem, but some linkers will complain about
4694 multiply-defined symbols. The most insidious aspect of this is that
4695 often the link will succeed anyway, but then the resulting executable
4696 will sometimes crash in obscure ways during certain operations!
4697
4698 To avoid this problem, declare any symbols with common names (such as
4699 @code{text}) that are not obviously associated with this particular
4700 module in the file @file{general-slots.h}. The ``-slots'' suffix
4701 indicates that this is a file that is included multiple times in
4702 @file{general.c}. Redefinition of preprocessor macros allows the
4703 effects to be different in each context, so this is actually more
4704 convenient and less error-prone than doing it in your module.
4705
4706 Global variables whose names begin with @samp{V} are variables that
4707 contain Lisp objects. The convention here is that all global variables
4708 of type @code{Lisp_Object} begin with @samp{V}, and all others don't
4709 (including integer and boolean variables that have Lisp
4710 equivalents). Most of the time, these variables have equivalents in
4711 Lisp, but some don't. Those that do are declared this way by a call to
4712 @code{DEFVAR_LISP()} in the @code{vars_of_*()} initializer for the
4713 module. What this does is create a special @dfn{symbol-value-forward}
4714 Lisp object that contains a pointer to the C variable, intern a symbol
4715 whose name is as specified in the call to @code{DEFVAR_LISP()}, and set
4716 its value to the symbol-value-forward Lisp object; it also calls
4717 @code{staticpro()} on the C variable to tell the garbage-collection
4718 mechanism about the variable. When @code{eval} (or actually
4719 @code{symbol-value}) encounters this special object in the process of
4720 retrieving a variable's value, it follows the indirection to the C
4721 variable and gets its value. @code{setq} does similar things so that
4722 the C variable gets changed.
4723
4724 Whether or not you @code{DEFVAR_LISP()} a variable, you need to
4725 initialize it in the @code{vars_of_*()} function; otherwise it will end
4726 up as all zeroes, which is the integer 0 (@emph{not} @code{nil}), and
4727 this is probably not what you want. Also, if the variable is not
4728 @code{DEFVAR_LISP()}ed, @strong{you must call} @code{staticpro()} on the
4729 C variable in the @code{vars_of_*()} function. Otherwise, the
4730 garbage-collection mechanism won't know that the object in this variable
4731 is in use, and will happily collect it and reuse its storage for another
4732 Lisp object, and you will be the one who's unhappy when you can't figure
4733 out how your variable got overwritten.
4734
4735 @node Writing Macros, Proper Use of Unsigned Types, Adding Global Lisp Variables, Rules When Writing New C Code
4736 @section Writing Macros
4737 @cindex writing macros
4738 @cindex macros, writing
4739
4740 The three golden rules of macros:
4741
4742 @enumerate
4743 @item
4744 Anything that's an lvalue can be evaluated more than once.
4745 @item
4746 Macros where anything else can be evaluated more than once should
4747 have the word "unsafe" in their name (exceptions may be made for
4748 large sets of macros that evaluate arguments of certain types more
4749 than once, e.g. struct buffer * arguments, when clearly indicated in
4750 the macro documentation). These macros are generally meant to be
4751 called only by other macros that have already stored the calling
4752 values in temporary variables.
4753 @item
4754 Nothing else can be evaluated more than once. Use inline
4755 functions, if necessary, to prevent multiple evaluation.
4756 @end enumerate
4757
4758 NOTE: The functions and macros below are given full prototypes in their
4759 docs, even when the implementation is a macro. In such cases, passing
4760 an argument of a type other than expected will produce undefined
4761 results. Also, given that macros can do things functions can't (in
4762 particular, directly modify arguments as if they were passed by
4763 reference), the declaration syntax has been extended to include the
4764 call-by-reference syntax from C++, where an & after a type indicates
4765 that the argument is an lvalue and is passed by reference, i.e. the
4766 function can modify its value. (This is equivalent in C to passing a
4767 pointer to the argument, but without the need to explicitly worry about
4768 pointers.)
4769
4770 When to capitalize macros:
4771
4772 @itemize @bullet
4773 @item
4774 Capitalize macros doing stuff obviously impossible with (C)
4775 functions, e.g. directly modifying arguments as if they were passed by
4776 reference.
4777 @item
4778 Capitalize macros that evaluate @strong{any} argument more than once regardless
4779 of whether that's "allowed" (e.g. buffer arguments).
4780 @item
4781 Capitalize macros that directly access a field in a Lisp_Object or
4782 its equivalent underlying structure. In such cases, access through the
4783 Lisp_Object precedes the macro with an X, and access through the underlying
4784 structure doesn't.
4785 @item
4786 Capitalize certain other basic macros relating to Lisp_Objects; e.g.
4787 FRAMEP, CHECK_FRAME, etc.
4788 @item
4789 Try to avoid capitalizing any other macros.
4790 @end itemize
4791
4792 @node Proper Use of Unsigned Types, Techniques for XEmacs Developers, Writing Macros, Rules When Writing New C Code
4793 @section Proper Use of Unsigned Types
4794 @cindex unsigned types, proper use of
4795 @cindex types, proper use of unsigned
4796
4797 Avoid using @code{unsigned int} and @code{unsigned long} whenever
4798 possible. Unsigned types are viral -- any arithmetic or comparisons
4799 involving mixed signed and unsigned types are automatically converted to
4800 unsigned, which is almost certainly not what you want. Many subtle and
4801 hard-to-find bugs are created by careless use of unsigned types. In
4802 general, you should almost @emph{never} use an unsigned type to hold a
4803 regular quantity of any sort. The only exceptions are
4804
4805 @enumerate
4806 @item
4807 When there's a reasonable possibility you will actually need all 32 or
4808 64 bits to store the quantity.
4809 @item
4810 When calling existing API's that require unsigned types. In this case,
4811 you should still do all manipulation using signed types, and do the
4812 conversion at the very threshold of the API call.
4813 @item
4814 In existing code that you don't want to modify because you don't
4815 maintain it.
4816 @item
4817 In bit-field structures.
4818 @end enumerate
4819
4820 Other reasonable uses of @code{unsigned int} and @code{unsigned long}
4821 are representing non-quantities -- e.g. bit-oriented flags and such.
4822
4823 @node Techniques for XEmacs Developers, , Proper Use of Unsigned Types, Rules When Writing New C Code
4824 @section Techniques for XEmacs Developers
4825 @cindex techniques for XEmacs developers
4826 @cindex developers, techniques for XEmacs
4827
4828 @cindex Purify
4829 @cindex Quantify
4830 To make a purified XEmacs, do: @code{make puremacs}.
4831 To make a quantified XEmacs, do: @code{make quantmacs}.
4832
4833 You simply can't dump Quantified and Purified images (unless using the
4834 portable dumper). Purify gets confused when xemacs frees memory in one
4835 process that was allocated in a @emph{different} process on a different
4836 machine! Run it like so:
4837 @example
4838 temacs -batch -l loadup.el run-temacs @var{xemacs-args...}
4839 @end example
4840
4841 @cindex error checking
4842 Before you go through the trouble, are you compiling with all
4843 debugging and error-checking off? If not, try that first. Be warned
4844 that while Quantify is directly responsible for quite a few
4845 optimizations which have been made to XEmacs, doing a run which
4846 generates results which can be acted upon is not necessarily a trivial
4847 task.
4848
4849 Also, if you're still willing to do some runs make sure you configure
4850 with the @samp{--quantify} flag. That will keep Quantify from starting
4851 to record data until after the loadup is completed and will shut off
4852 recording right before it shuts down (which generates enough bogus data
4853 to throw most results off). It also enables three additional elisp
4854 commands: @code{quantify-start-recording-data},
4855 @code{quantify-stop-recording-data} and @code{quantify-clear-data}.
4856
4857 If you want to make XEmacs faster, target your favorite slow benchmark,
4858 run a profiler like Quantify, @code{gprof}, or @code{tcov}, and figure
4859 out where the cycles are going. In many cases you can localize the
4860 problem (because a particular new feature or even a single patch
4861 elicited it). Don't hesitate to use brute force techniques like a
4862 global counter incremented at strategic places, especially in
4863 combination with other performance indications (@emph{e.g.}, degree of
4864 buffer fragmentation into extents).
4865
4866 Specific projects:
4867
4868 @itemize @bullet
4869 @item
4870 Make the garbage collector faster. Figure out how to write an
4871 incremental garbage collector.
4872 @item
4873 Write a compiler that takes bytecode and spits out C code.
4874 Unfortunately, you will then need a C compiler and a more fully
4875 developed module system.
4876 @item
4877 Speed up redisplay.
4878 @item
4879 Speed up syntax highlighting. It was suggested that ``maybe moving some
4880 of the syntax highlighting capabilities into C would make a
4881 difference.'' Wrong idea, I think. When processing one 400kB file a
4882 particular low-level routine was being called 40 @emph{million} times
4883 simply for @emph{one} call to @code{newline-and-indent}. Syntax
4884 highlighting needs to be rewritten to use a reliable, fast parser, then
4885 to trust the pre-parsed structure, and only do re-highlighting locally
4886 to a text change. Modern machines are fast enough to implement such
4887 parsers in Lisp; but no machine will ever be fast enough to deal with
4888 quadratic (or worse) algorithms!
4889 @item
4890 Implement tail recursion in Emacs Lisp (hard!).
4891 @end itemize
4892
4893 Unfortunately, Emacs Lisp is slow, and is going to stay slow. Function
4894 calls in elisp are especially expensive. Iterating over a long list is
4895 going to be 30 times faster implemented in C than in Elisp.
4896
4897 Heavily used small code fragments need to be fast. The traditional way
4898 to implement such code fragments in C is with macros. But macros in C
4899 are known to be broken.
4900
4901 @cindex macro hygiene
4902 Macro arguments that are repeatedly evaluated may suffer from repeated
4903 side effects or suboptimal performance.
4904
4905 Variable names used in macros may collide with caller's variables,
4906 causing (at least) unwanted compiler warnings.
4907
4908 In order to solve these problems, and maintain statement semantics, one
4909 should use the @code{do @{ ... @} while (0)} trick while trying to
4910 reference macro arguments exactly once using local variables.
4911
4912 Let's take a look at this poor macro definition:
4913
4914 @example
4915 #define MARK_OBJECT(obj) \
4916 if (!marked_p (obj)) mark_object (obj), did_mark = 1
4917 @end example
4918
4919 This macro evaluates its argument twice, and also fails if used like this:
4920 @example
4921 if (flag) MARK_OBJECT (obj); else @code{do_something()};
4922 @end example
4923
4924 A much better definition is
4925
4926 @example
4927 #define MARK_OBJECT(obj) do @{ \
4928 Lisp_Object mo_obj = (obj); \
4929 if (!marked_p (mo_obj)) \
4930 @{ \
4931 mark_object (mo_obj); \
4932 did_mark = 1; \
4933 @} \
4934 @} while (0)
4935 @end example
4936
4937 Notice the elimination of double evaluation by using the local variable
4938 with the obscure name. Writing safe and efficient macros requires great
4939 care. The one problem with macros that cannot be portably worked around
4940 is, since a C block has no value, a macro used as an expression rather
4941 than a statement cannot use the techniques just described to avoid
4942 multiple evaluation.
4943
4944 @cindex inline functions
4945 In most cases where a macro has function semantics, an inline function
4946 is a better implementation technique. Modern compiler optimizers tend
4947 to inline functions even if they have no @code{inline} keyword, and
4948 configure magic ensures that the @code{inline} keyword can be safely
4949 used as an additional compiler hint. Inline functions used in a single
4950 .c files are easy. The function must already be defined to be
4951 @code{static}. Just add another @code{inline} keyword to the
4952 definition.
4953
4954 @example
4955 inline static int
4956 heavily_used_small_function (int arg)
4957 @{
4958 ...
4959 @}
4960 @end example
4961
4962 Inline functions in header files are trickier, because we would like to
4963 make the following optimization if the function is @emph{not} inlined
4964 (for example, because we're compiling for debugging). We would like the
4965 function to be defined externally exactly once, and each calling
4966 translation unit would create an external reference to the function,
4967 instead of including a definition of the inline function in the object
4968 code of every translation unit that uses it. This optimization is
4969 currently only available for gcc. But you don't have to worry about the
4970 trickiness; just define your inline functions in header files using this
4971 pattern:
4972
4973 @example
4974 DECLARE_INLINE_HEADER (
4975 int
4976 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg)
4977 )
4978 @{
4979 ...
4980 @}
4981 @end example
4982
4983 We use @code{DECLARE_INLINE_HEADER} rather than just the modifier
4984 @code{INLINE_HEADER} to prevent warnings when compiling with @code{gcc
4985 -Wmissing-declarations}. I consider issuing this warning for inline
4986 functions a gcc bug, but the gcc maintainers disagree.
4987
4988 @cindex inline functions, headers
4989 @cindex header files, inline functions
4990 Every header which contains inline functions, either directly by using
4991 @code{DECLARE_INLINE_HEADER} or indirectly by using @code{DECLARE_LRECORD} must
4992 be added to @file{inline.c}'s includes to make the optimization
4993 described above work. (Optimization note: if all INLINE_HEADER
4994 functions are in fact inlined in all translation units, then the linker
4995 can just discard @code{inline.o}, since it contains only unreferenced code).
4996
4997 To get started debugging XEmacs, take a look at the @file{.gdbinit} and
4998 @file{.dbxrc} files in the @file{src} directory. See the section in the
4999 XEmacs FAQ on How to Debug an XEmacs problem with a debugger.
5000
5001 After making source code changes, run @code{make check} to ensure that
5002 you haven't introduced any regressions. If you want to make xemacs more
5003 reliable, please improve the test suite in @file{tests/automated}.
5004
5005 Did you make sure you didn't introduce any new compiler warnings?
5006
5007 Before submitting a patch, please try compiling at least once with
5008
5009 @example
5010 configure --with-mule --use-union-type --error-checking=all
5011 @end example
5012
5013 Here are things to know when you create a new source file:
5014
5015 @itemize @bullet
5016 @item
5017 All @file{.c} files should @code{#include <config.h>} first. Almost all
5018 @file{.c} files should @code{#include "lisp.h"} second.
5019
5020 @item
5021 Generated header files should be included using the @samp{#include <...>}
5022 syntax, not the @samp{#include "..."} syntax. The generated headers are:
5023
5024 @file{config.h sheap-adjust.h paths.h Emacs.ad.h}
5025
5026 The basic rule is that you should assume builds using @samp{--srcdir}
5027 and the @samp{#include <...>} syntax needs to be used when the
5028 to-be-included generated file is in a potentially different directory
5029 @emph{at compile time}. The non-obvious C rule is that
5030 @samp{#include "..."} means to search for the included file in the same
5031 directory as the including file, @emph{not} in the current directory.
5032 Normally this is not a problem but when building with @samp{--srcdir},
5033 @file{make} will search the @samp{VPATH} for you, while the C compiler
5034 knows nothing about it.
5035
5036 @item
5037 Header files should @emph{not} include @samp{<config.h>} and
5038 @samp{"lisp.h"}. It is the responsibility of the @file{.c} files that
5039 use it to do so.
5040
5041 @end itemize
5042
5043 @cindex Lisp object types, creating
5044 @cindex creating Lisp object types
5045 @cindex object types, creating Lisp
5046 Here is a checklist of things to do when creating a new lisp object type
5047 named @var{foo}:
5048
5049 @enumerate
5050 @item
5051 create @var{foo}.h
5052 @item
5053 create @var{foo}.c
5054 @item
5055 add definitions of @code{syms_of_@var{foo}}, etc. to @file{@var{foo}.c}
5056 @item
5057 add declarations of @code{syms_of_@var{foo}}, etc. to @file{symsinit.h}
5058 @item
5059 add calls to @code{syms_of_@var{foo}}, etc. to @file{emacs.c}
5060 @item
5061 add definitions of macros like @code{CHECK_@var{FOO}} and
5062 @code{@var{FOO}P} to @file{@var{foo}.h}
5063 @item
5064 add the new type index to @code{enum lrecord_type}
5065 @item
5066 add a DEFINE_LRECORD_IMPLEMENTATION call to @file{@var{foo}.c}
5067 @item
5068 add an INIT_LRECORD_IMPLEMENTATION call to @code{syms_of_@var{foo}.c}
5069 @end enumerate
5070
5071 @node Regression Testing XEmacs, CVS Techniques, Rules When Writing New C Code, Top
5072 @chapter Regression Testing XEmacs
5073 @cindex testing, regression
5074
5075 @menu
5076 * How to Regression-Test::
5077 * Modules for Regression Testing::
5078 @end menu
5079
5080 @node How to Regression-Test, Modules for Regression Testing, Regression Testing XEmacs, Regression Testing XEmacs
5081 @section How to Regression-Test
5082 @cindex how to regression-test
5083 @cindex regression-test, how to
5084 @cindex testing, regression, how to
5085
5086 The source directory @file{tests/automated} contains XEmacs' automated
5087 test suite. The usual way of running all the tests is running
5088 @code{make check} from the top-level build directory.
5089
5090 The test suite is unfinished and it's still lacking some essential
5091 features. It is nevertheless recommended that you run the tests to
5092 confirm that XEmacs behaves correctly.
5093
5094 If you want to run a specific test case, you can do it from the
5095 command-line like this:
5096
5097 @example
5098 $ xemacs -batch -l test-harness.elc -f batch-test-emacs TEST-FILE
5099 @end example
5100
5101 If a test fails and you need more information, you can run the test
5102 suite interactively by loading @file{test-harness.el} into a running
5103 XEmacs and typing @kbd{M-x test-emacs-test-file RET <filename> RET}.
5104 You will see a log of passed and failed tests, which should allow you to
5105 investigate the source of the error and ultimately fix the bug. If you
5106 are not capable of, or don't have time for, debugging it yourself,
5107 please do report the failures using @kbd{M-x report-emacs-bug} or
5108 @kbd{M-x build-report}.
5109
5110 @deffn Command test-emacs-test-file file
5111 Runs the tests in @var{file}. @file{test-harness.el} must be loaded.
5112 Defines all the macros described in this node, and undefines them when
5113 done.
5114 @end deffn
5115
5116 Adding a new test file is trivial: just create a new file here and it
5117 will be run. There is no need to byte-compile any of the files in
5118 this directory---the test-harness will take care of any necessary
5119 byte-compilation.
5120
5121 Look at the existing test cases for the examples of coding test cases.
5122 It all boils down to your imagination and judicious use of the macros
5123 @code{Assert}, @code{Check-Error}, @code{Check-Error-Message}, and
5124 @code{Check-Message}. Note that all of these macros are defined only
5125 for the duration of the test: they do not exist in the global
5126 environment.
5127
5128 @deffn Macro Assert expr
5129 Check that @var{expr} is non-nil at this point in the test.
5130 @end deffn
5131
5132 @deffn Macro Check-Error expected-error body
5133 Check that execution of @var{body} causes @var{expected-error} to be
5134 signaled. @var{body} is a @code{progn}-like body, and may contain
5135 several expressions. @var{expected-error} is a symbol defined as
5136 an error by @code{define-error}.
5137 @end deffn
5138
5139 @deffn Macro Check-Error-Message expected-error expected-error-regexp body
5140 Check that execution of @var{body} causes @var{expected-error} to be
5141 signaled, and generate a message matching @var{expected-error-regexp}.
5142 @var{body} is a @code{progn}-like body, and may contain several
5143 expressions. @var{expected-error} is a symbol defined as an error
5144 by @code{define-error}.
5145 @end deffn
5146
5147 @deffn Macro Check-Message expected-message body
5148 Check that execution of @var{body} causes @var{expected-message} to be
5149 generated (using @code{message} or a similar function). @var{body} is a
5150 @code{progn}-like body, and may contain several expressions.
5151 @end deffn
5152
5153 Here's a simple example checking case-sensitive and case-insensitive
5154 comparisons from @file{case-tests.el}.
5155
5156 @example
5157 (with-temp-buffer
5158 (insert "Test Buffer")
5159 (let ((case-fold-search t))
5160 (goto-char (point-min))
5161 (Assert (eq (search-forward "test buffer" nil t) 12))
5162 (goto-char (point-min))
5163 (Assert (eq (search-forward "Test buffer" nil t) 12))
5164 (goto-char (point-min))
5165 (Assert (eq (search-forward "Test Buffer" nil t) 12))
5166
5167 (setq case-fold-search nil)
5168 (goto-char (point-min))
5169 (Assert (not (search-forward "test buffer" nil t)))
5170 (goto-char (point-min))
5171 (Assert (not (search-forward "Test buffer" nil t)))
5172 (goto-char (point-min))
5173 (Assert (eq (search-forward "Test Buffer" nil t) 12))))
5174 @end example
5175
5176 This example could be saved in a file in @file{tests/automated}, and it
5177 would constitute a complete test, automatically executed when you run
5178 @kbd{make check} after building XEmacs. More complex tests may require
5179 substantial temporary scaffolding to create the environment that elicits
5180 the bugs, but the top-level @file{Makefile} and @file{test-harness.el}
5181 handle the running and collection of results from the @code{Assert},
5182 @code{Check-Error}, @code{Check-Error-Message}, and @code{Check-Message}
5183 macros.
5184
5185 Don't suppress tests just because they're due to known bugs not yet
5186 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark
5187 them.
5188
5189 @deffn Macro Known-Bug-Expect-Failure body
5190 Arrange for failing tests in @var{body} to generate messages prefixed
5191 with "KNOWN BUG:" instead of "FAIL:". @var{body} is a @code{progn}-like
5192 body, and may contain several tests.
5193 @end deffn
5194
5195 A lot of the tests we run push limits; suppress Ebola warning messages
5196 with the @code{Ignore-Ebola} wrapper macro.
5197
5198 @deffn Macro Ignore-Ebola body
5199 Suppress Ebola warning messages while running tests in @var{body}.
5200 @var{body} is a @code{progn}-like body, and may contain several tests.
5201 @end deffn
5202
5203 Both macros are defined temporarily within the test function. Simple
5204 examples:
5205
5206 @example
5207 ;; Apparently Ignore-Ebola is a solution with no problem to address.
5208 ;; There are no examples in 21.5, anyway.
5209
5210 ;; from regexp-tests.el
5211 (Known-Bug-Expect-Failure
5212 (Assert (not (string-match "\\b" "")))
5213 (Assert (not (string-match " \\b" " "))))
5214 @end example
5215
5216 In general, you should avoid using functionality from packages in your
5217 tests, because you can't be sure that everyone will have the required
5218 package. However, if you've got a test that works, by all means add it.
5219 Simply wrap the test in an appropriate test, add a notice that the test
5220 was skipped, and update the @code{skipped-test-reasons} hashtable. The
5221 wrapper macro @code{Skip-Test-Unless} is provided to handle common
5222 cases.
5223
5224 @defvar skipped-test-reasons
5225 Hash table counting the number of times a particular reason is given for
5226 skipping tests. This is only defined within @code{test-emacs-test-file}.
5227 @end defvar
5228
5229 @deffn Macro Skip-Test-Unless prerequisite reason description body
5230 @var{prerequisite} is usually a feature test (@code{featurep},
5231 @code{boundp}, @code{fboundp}). @var{reason} is a string describing the
5232 prerequisite; it must be unique because it is used as a hash key in a
5233 table of reasons for skipping tests. @var{description} describes the
5234 tests being skipped, for the test result summary. @var{body} is a
5235 @code{progn}-like body, and may contain several tests.
5236 @end deffn
5237
5238 @code{Skip-Test-Unless} is defined temporarily within the test function.
5239 Here's an example of usage from @file{syntax-tests.el}:
5240
5241 @example
5242 ;; Test forward-comment at buffer boundaries
5243 (with-temp-buffer
5244 ;; try to use exactly what you need: featurep, boundp, fboundp
5245 (Skip-Test-Unless (fboundp 'c-mode)
5246 "c-mode unavailable"
5247 "comment and parse-partial-sexp tests"
5248 ;; and here's the test code
5249 (c-mode)
5250 (insert "// comment\n")
5251 (forward-comment -2)
5252 (Assert (eq (point) (point-min)))
5253 (let ((point (point)))
5254 (insert "/* comment */")
5255 (goto-char point)
5256 (forward-comment 2)
5257 (Assert (eq (point) (point-max)))
5258 (parse-partial-sexp point (point-max)))))
5259 @end example
5260
5261 @code{Skip-Test-Unless} is intended for use with features that are normally
5262 present in typical configurations. For truly optional features, or
5263 tests that apply to one of several alternative implementations (eg, to
5264 GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply
5265 silently suppress the test if the feature is not available.
5266
5267 Here are a few general hints for writing tests.
5268
5269 @enumerate
5270 @item
5271 Include related successful cases. Fixes often break something.
5272
5273 @item
5274 Use the Known-Bug-Expect-Failure macro to mark the cases you know
5275 are going to fail. We want to be able to distinguish between
5276 regressions and other unexpected failures, and cases that have
5277 been (partially) analyzed but not yet repaired.
5278
5279 @item
5280 Mark the bug with the date of report. An ``Unfixed since yyyy-mm-dd''
5281 gloss for Known-Bug-Expect-Failure is planned to further increase
5282 developer embarrassment (== incentive to fix the bug), but until then at
5283 least put a comment about the date so we can easily see when it was
5284 first reported.
5285
5286 @item
5287 It's a matter of your judgement, but you should often use generic tests
5288 (@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for
5289 numbers) even though you know that arguments ``should'' be of correct
5290 type. That is, if the functions used can return generic objects
5291 (typically @code{nil}), as well as some more specific type that will be
5292 returned on success. We don't want failures of those assertions
5293 reported as ``other failures'' (a wrong-type-arg signal, rather than a
5294 null return), we want them reported as ``assertion failures.''
5295
5296 One example is a test that tests @code{(= (string-match this that) 0)},
5297 expecting a successful match. Now suppose @code{string-match} is broken
5298 such that the match fails. Then it will return @code{nil}, and @code{=}
5299 will signal ``wrong-type-argument, number-char-or-marker-p, nil'',
5300 generating an ``other failure'' in the report. But this should be
5301 reported as an assertion failure (the test failed in a foreseeable way),
5302 rather than something else (we don't know what happened because XEmacs
5303 is broken in a way that we weren't trying to test!)
5304 @end enumerate
5305
5306 @node Modules for Regression Testing, , How to Regression-Test, Regression Testing XEmacs
5307 @section Modules for Regression Testing
5308 @cindex modules for regression testing
5309 @cindex regression testing, modules for
5310
5311 @example
5312 @file{test-harness.el}
5313 @file{base64-tests.el}
5314 @file{byte-compiler-tests.el}
5315 @file{case-tests.el}
5316 @file{ccl-tests.el}
5317 @file{c-tests.el}
5318 @file{database-tests.el}
5319 @file{extent-tests.el}
5320 @file{hash-table-tests.el}
5321 @file{lisp-tests.el}
5322 @file{md5-tests.el}
5323 @file{mule-tests.el}
5324 @file{regexp-tests.el}
5325 @file{symbol-tests.el}
5326 @file{syntax-tests.el}
5327 @file{tag-tests.el}
5328 @file{weak-tests.el}
5329 @end example
5330
5331 @file{test-harness.el} defines the macros @code{Assert},
5332 @code{Check-Error}, @code{Check-Error-Message}, and
5333 @code{Check-Message}. The other files are test files, testing various
5334 XEmacs facilities. @xref{Regression Testing XEmacs}.
5335
5336
5337 @node CVS Techniques, XEmacs from the Inside, Regression Testing XEmacs, Top
5338 @chapter CVS Techniques
5339 @cindex CVS techniques
5340
5341 @menu
5342 * Merging a Branch into the Trunk::
5343 @end menu
5344
5345 @node Merging a Branch into the Trunk, , CVS Techniques, CVS Techniques
5346 @section Merging a Branch into the Trunk
5347 @cindex merging a branch into the trunk
5348
5349 @enumerate
5350 @item
5351 If you haven't already done a merge, you will be merging from the branch
5352 point; otherwise you'll be merging from the last merge point, which
5353 should be marked by a tag, e.g. @samp{last-sync-ben-mule-21-5}. In the
5354 former case, create the last-sync tag, e.g.
5355
5356 @example
5357 crw rtag -r ben-mule-21-5-bp last-sync-ben-mule-21-5 xemacs
5358 @end example
5359
5360 (You did create a branch point tag when you created the branch, didn't
5361 you?)
5362
5363 @item
5364 Check everything in on your branch.
5365
5366 @item
5367 Tag your branch with a pre-sync tag, e.g.
5368
5369 @example
5370 crw rtag -r ben-mule-21-5 ben-mule-21-5-pre-feb-20-2002-sync xemacs
5371 @end example
5372
5373 Note, you need to use rtag and specify a version with @samp{-r} (use
5374 @samp{-r HEAD} if necessary) so that removed files are handled correctly
5375 in some obscure cases. See section 4.8 of the CVS manual.
5376
5377 @item
5378 Tag the trunk so you have a stable place to merge up to in case people
5379 are asynchronously committing to the trunk, e.g.
5380
5381 @example
5382 crw rtag -r HEAD main-branch-ben-mule-21-5-syncpoint-feb-20-2002 xemacs
5383 crw rtag -F -r main-branch-ben-mule-21-5-syncpoint-feb-20-2002 next-sync-ben-mule-21-5 xemacs
5384 @end example
5385
5386 Use -F in the second case because the name might already exist, e.g. if
5387 you've already done a merge. We make two tags because one is a
5388 permanent mark indicating a syncpoint when merging, and the other is a
5389 symbolic tag to make other operations easier.
5390
5391 @item
5392 Make a backup of your source tree (not totally necessary but useful for
5393 reference and peace of mind): Move one level up from the top directory
5394 of your branch and do, e.g.
5395
5396 @example
5397 cp -a mule mule-backup-2-23-02
5398 @end example
5399
5400 @item
5401 Now, we're ready to merge! Make sure you're in the top directory of
5402 your branch and do, e.g.
5403
5404 @example
5405 cvs update -j last-sync-ben-mule-21-5 -j next-sync-ben-mule-21-5
5406 @end example
5407
5408 @item
5409 Fix all merge conflicts. Get the sucker to compile and run.
5410
5411 @item
5412 Tag your branch with a post-sync tag, e.g.
5413
5414 @example
5415 crw rtag -r ben-mule-21-5 ben-mule-21-5-post-feb-20-2002-sync xemacs
5416 @end example
5417
5418 @item
5419 Update the last-sync tag, e.g.
5420
5421 @example
5422 crw rtag -F -r next-sync-ben-mule-21-5 last-sync-ben-mule-21-5 xemacs
5423 @end example
5424 @end enumerate
5425
5426
5427 @node XEmacs from the Inside, The XEmacs Object System (Abstractly Speaking), CVS Techniques, Top
5428 @chapter XEmacs from the Inside
5429 @cindex XEmacs from the inside
5430 @cindex inside, XEmacs from the
5431
5432 Internally, XEmacs is quite complex, and can be very confusing. To
5433 simplify things, it can be useful to think of XEmacs as containing an
5434 event loop that ``drives'' everything, and a number of other subsystems,
5435 such as a Lisp engine and a redisplay mechanism. Each of these other
5436 subsystems exists simultaneously in XEmacs, and each has a certain
5437 state. The flow of control continually passes in and out of these
5438 different subsystems in the course of normal operation of the editor.
5439
5440 It is important to keep in mind that, most of the time, the editor is
5441 ``driven'' by the event loop. Except during initialization and batch
5442 mode, all subsystems are entered directly or indirectly through the
5443 event loop, and ultimately, control exits out of all subsystems back up
5444 to the event loop. This cycle of entering a subsystem, exiting back out
5445 to the event loop, and starting another iteration of the event loop
5446 occurs once each keystroke, mouse motion, etc.
5447
5448 If you're trying to understand a particular subsystem (other than the
5449 event loop), think of it as a ``daemon'' process or ``servant'' that is
5450 responsible for one particular aspect of a larger system, and
5451 periodically receives commands or environment changes that cause it to
5452 do something. Ultimately, these commands and environment changes are
5453 always triggered by the event loop. For example:
5454
5455 @itemize @bullet
5456 @item
5457 The window and frame mechanism is responsible for keeping track of what
5458 windows and frames exist, what buffers are in them, etc. It is
5459 periodically given commands (usually from the user) to make a change to
5460 the current window/frame state: i.e. create a new frame, delete a
5461 window, etc.
5462
5463 @item
5464 The buffer mechanism is responsible for keeping track of what buffers
5465 exist and what text is in them. It is periodically given commands
5466 (usually from the user) to insert or delete text, create a buffer, etc.
5467 When it receives a text-change command, it notifies the redisplay
5468 mechanism.
5469
5470 @item
5471 The redisplay mechanism is responsible for making sure that windows and
5472 frames are displayed correctly. It is periodically told (by the event
5473 loop) to actually ``do its job'', i.e. snoop around and see what the
5474 current state of the environment (mostly of the currently-existing
5475 windows, frames, and buffers) is, and make sure that state matches
5476 what's actually displayed. It keeps lots and lots of information around
5477 (such as what is actually being displayed currently, and what the
5478 environment was last time it checked) so that it can minimize the work
5479 it has to do. It is also helped along in that whenever a relevant
5480 change to the environment occurs, the redisplay mechanism is told about
5481 this, so it has a pretty good idea of where it has to look to find
5482 possible changes and doesn't have to look everywhere.
5483
5484 @item
5485 The Lisp engine is responsible for executing the Lisp code in which most
5486 user commands are written. It is entered through a call to @code{eval}
5487 or @code{funcall}, which occurs as a result of dispatching an event from
5488 the event loop. The functions it calls issue commands to the buffer
5489 mechanism, the window/frame subsystem, etc.
5490
5491 @item
5492 The Lisp allocation subsystem is responsible for keeping track of Lisp
5493 objects. It is given commands from the Lisp engine to allocate objects,
5494 garbage collect, etc.
5495 @end itemize
5496
5497 etc.
5498
5499 The important idea here is that there are a number of independent
5500 subsystems each with its own responsibility and persistent state, just
5501 like different employees in a company, and each subsystem is
5502 periodically given commands from other subsystems. Commands can flow
5503 from any one subsystem to any other, but there is usually some sort of
5504 hierarchy, with all commands originating from the event subsystem.
5505
5506 XEmacs is entered in @code{main()}, which is in @file{emacs.c}. When
5507 this is called the first time (in a properly-invoked @file{temacs}), it
5508 does the following:
5509
5510 @enumerate
5511 @item
5512 It does some very basic environment initializations, such as determining
5513 where it and its directories (e.g. @file{lisp/} and @file{etc/}) reside
5514 and setting up signal handlers.
5515 @item
5516 It initializes the entire Lisp interpreter.
5517 @item
5518 It sets the initial values of many built-in variables (including many
5519 variables that are visible to Lisp programs), such as the global keymap
5520 object and the built-in faces (a face is an object that describes the
5521 display characteristics of text). This involves creating Lisp objects
5522 and thus is dependent on step (2).
5523 @item
5524 It performs various other initializations that are relevant to the
5525 particular environment it is running in, such as retrieving environment
5526 variables, determining the current date and the user who is running the
5527 program, examining its standard input, creating any necessary file
5528 descriptors, etc.
5529 @item
5530 At this point, the C initialization is complete. A Lisp program that
5531 was specified on the command line (usually @file{loadup.el}) is called
5532 (temacs is normally invoked as @code{temacs -batch -l loadup.el dump}).
5533 @file{loadup.el} loads all of the other Lisp files that are needed for
5534 the operation of the editor, calls the @code{dump-emacs} function to
5535 write out @file{xemacs}, and then kills the temacs process.
5536 @end enumerate
5537
5538 When @file{xemacs} is then run, it only redoes steps (1) and (4)
5539 above; all variables already contain the values they were set to when
5540 the executable was dumped, and all memory that was allocated with
5541 @code{malloc()} is still around. (XEmacs knows whether it is being run
5542 as @file{xemacs} or @file{temacs} because it sets the global variable
5543 @code{initialized} to 1 after step (4) above.) At this point,
5544 @file{xemacs} calls a Lisp function to do any further initialization,
5545 which includes parsing the command-line (the C code can only do limited
5546 command-line parsing, which includes looking for the @samp{-batch} and
5547 @samp{-l} flags and a few other flags that it needs to know about before
5548 initialization is complete), creating the first frame (or @dfn{window}
5549 in standard window-system parlance), running the user's init file
5550 (usually the file @file{.emacs} in the user's home directory), etc. The
5551 function to do this is usually called @code{normal-top-level};
5552 @file{loadup.el} tells the C code about this function by setting its
5553 name as the value of the Lisp variable @code{top-level}.
5554
5555 When the Lisp initialization code is done, the C code enters the event
5556 loop, and stays there for the duration of the XEmacs process. The code
5557 for the event loop is contained in @file{cmdloop.c}, and is called
5558 @code{Fcommand_loop_1()}. Note that this event loop could very well be
5559 written in Lisp, and in fact a Lisp version exists; but apparently,
5560 doing this makes XEmacs run noticeably slower.
5561
5562 Notice how much of the initialization is done in Lisp, not in C.
5563 In general, XEmacs tries to move as much code as is possible
5564 into Lisp. Code that remains in C is code that implements the
5565 Lisp interpreter itself, or code that needs to be very fast, or
5566 code that needs to do system calls or other such stuff that
5567 needs to be done in C, or code that needs to have access to
5568 ``forbidden'' structures. (One conscious aspect of the design of
5569 Lisp under XEmacs is a clean separation between the external
5570 interface to a Lisp object's functionality and its internal
5571 implementation. Part of this design is that Lisp programs
5572 are forbidden from accessing the contents of the object other
5573 than through using a standard API. In this respect, XEmacs Lisp
5574 is similar to modern Lisp dialects but differs from GNU Emacs,
5575 which tends to expose the implementation and allow Lisp
5576 programs to look at it directly. The major advantage of
5577 hiding the implementation is that it allows the implementation
5578 to be redesigned without affecting any Lisp programs, including
5579 those that might want to be ``clever'' by looking directly at
5580 the object's contents and possibly manipulating them.)
5581
5582 Moving code into Lisp makes the code easier to debug and maintain and
5583 makes it much easier for people who are not XEmacs developers to
5584 customize XEmacs, because they can make a change with much less chance
5585 of obscure and unwanted interactions occurring than if they were to
5586 change the C code.
5587
5588 @node The XEmacs Object System (Abstractly Speaking), How Lisp Objects Are Represented in C, XEmacs from the Inside, Top
5589 @chapter The XEmacs Object System (Abstractly Speaking)
5590 @cindex XEmacs object system (abstractly speaking), the
5591 @cindex object system (abstractly speaking), the XEmacs
5592
5593 At the heart of the Lisp interpreter is its management of objects.
5594 XEmacs Lisp contains many built-in objects, some of which are
5595 simple and others of which can be very complex; and some of which
5596 are very common, and others of which are rarely used or are only
5597 used internally. (Since the Lisp allocation system, with its
5598 automatic reclamation of unused storage, is so much more convenient
5599 than @code{malloc()} and @code{free()}, the C code makes extensive use of it
5600 in its internal operations.)
5601
5602 The basic Lisp objects are
5603
5604 @table @code
5605 @item integer
5606 31 bits of precision, or 63 bits on 64-bit machines; the
5607 reason for this is described below when the internal Lisp object
5608 representation is described.
5609 @item char
5610 An object representing a single character of text; chars behave like
5611 integers in many ways but are logically considered text rather than
5612 numbers and have a different read syntax. (the read syntax for a char
5613 contains the char itself or some textual encoding of it---for example,
5614 a Japanese Kanji character might be encoded as @samp{^[$(B#&^[(B} using the
5615 ISO-2022 encoding standard---rather than the numerical representation
5616 of the char; this way, if the mapping between chars and integers
5617 changes, which is quite possible for Kanji characters and other extended
5618 characters, the same character will still be created. Note that some
5619 primitives confuse chars and integers. The worst culprit is @code{eq},
5620 which makes a special exception and considers a char to be @code{eq} to
5621 its integer equivalent, even though in no other case are objects of two
5622 different types @code{eq}. The reason for this monstrosity is
5623 compatibility with existing code; the separation of char from integer
5624 came fairly recently.)
5625 @item float
5626 Same precision as a double in C.
5627 @item bignum
5628 @itemx ratio
5629 @itemx bigfloat
5630 As build-time options, arbitrary-precision numbers are available.
5631 Bignums are integers, and when available they remove the restriction on
5632 buffer size. Ratios are non-integral rational numbers. Bigfloats are
5633 arbitrary-precision floating point numbers, with precision specified at
5634 runtime.
5635 @item symbol
5636 An object that contains Lisp objects and is referred to by name;
5637 symbols are used to implement variables and named functions
5638 and to provide the equivalent of preprocessor constants in C.
5639 @item string
5640 Self-explanatory; behaves much like a vector of chars
5641 but has a different read syntax and is stored and manipulated
5642 more compactly.
5643 @item bit-vector
5644 A vector of bits; similar to a string in spirit.
5645 @item vector
5646 A one-dimensional array of Lisp objects providing constant-time access
5647 to any of the objects; access to an arbitrary object in a vector is
5648 faster than for lists, but the operations that can be done on a vector
5649 are more limited.
5650 @item compiled-function
5651 An object containing compiled Lisp code, known as @dfn{byte code}.
5652 @item subr
5653 A Lisp primitive, i.e. a Lisp-callable function implemented in C.
5654 @item cons
5655 A simple container for two Lisp objects, used to implement lists and
5656 most other data structures in Lisp.
5657 @end table
5658
5659 Objects which are not conses are called atoms.
5660
5661 @cindex closure
5662 Note that there is no basic ``function'' type, as in more powerful
5663 versions of Lisp (where it's called a @dfn{closure}). XEmacs Lisp does
5664 not provide the closure semantics implemented by Common Lisp and Scheme.
5665 The guts of a function in XEmacs Lisp are represented in one of four
5666 ways: a symbol specifying another function (when one function is an
5667 alias for another), a list (whose first element must be the symbol
5668 @code{lambda}) containing the function's source code, a
5669 compiled-function object, or a subr object. (In other words, given a
5670 symbol specifying the name of a function, calling @code{symbol-function}
5671 to retrieve the contents of the symbol's function cell will return one
5672 of these types of objects.)
5673
5674 XEmacs Lisp also contains numerous specialized objects used to implement
5675 the editor:
5676
5677 @table @code
5678 @item buffer
5679 Stores text like a string, but is optimized for insertion and deletion
5680 and has certain other properties that can be set.
5681 @item frame
5682 An object with various properties whose displayable representation is a
5683 @dfn{window} in window-system parlance.
5684 @item window
5685 A section of a frame that displays the contents of a buffer;
5686 often called a @dfn{pane} in window-system parlance.
5687 @item window-configuration
5688 An object that represents a saved configuration of windows in a frame.
5689 @item device
5690 An object representing a screen on which frames can be displayed;
5691 equivalent to a @dfn{display} in the X Window System and a @dfn{TTY} in
5692 character mode.
5693 @item face
5694 An object specifying the appearance of text or graphics; it has
5695 properties such as font, foreground color, and background color.
5696 @item marker
5697 An object that refers to a particular position in a buffer and moves
5698 around as text is inserted and deleted to stay in the same relative
5699 position to the text around it.
5700 @item extent
5701 Similar to a marker but covers a range of text in a buffer; can also
5702 specify properties of the text, such as a face in which the text is to
5703 be displayed, whether the text is invisible or unmodifiable, etc.
5704 @item event
5705 Generated by calling @code{next-event} and contains information
5706 describing a particular event happening in the system, such as the user
5707 pressing a key or a process terminating.
5708 @item keymap
5709 An object that maps from events (described using lists, vectors, and
5710 symbols rather than with an event object because the mapping is for
5711 classes of events, rather than individual events) to functions to
5712 execute or other events to recursively look up; the functions are
5713 described by name, using a symbol, or using lists to specify the
5714 function's code.
5715 @item glyph
5716 An object that describes the appearance of an image (e.g. pixmap) on
5717 the screen; glyphs can be attached to the beginning or end of extents
5718 and in some future version of XEmacs will be able to be inserted
5719 directly into a buffer.
5720 @item process
5721 An object that describes a connection to an externally-running process.
5722 @end table
5723
5724 There are some other, less-commonly-encountered general objects:
5725
5726 @table @code
5727 @item hash-table
5728 An object that maps from an arbitrary Lisp object to another arbitrary
5729 Lisp object, using hashing for fast lookup.
5730 @item obarray
5731 A limited form of hash-table that maps from strings to symbols; obarrays
5732 are used to look up a symbol given its name and are not actually their
5733 own object type but are kludgily represented using vectors with hidden
5734 fields (this representation derives from GNU Emacs).
5735 @item specifier
5736 A complex object used to specify the value of a display property; a
5737 default value is given and different values can be specified for
5738 particular frames, buffers, windows, devices, or classes of device.
5739 @item char-table
5740 An object that maps from chars or classes of chars to arbitrary Lisp
5741 objects; internally char tables use a complex nested-vector
5742 representation that is optimized to the way characters are represented
5743 as integers.
5744 @item range-table
5745 An object that maps from ranges of integers to arbitrary Lisp objects.
5746 @end table
5747
5748 And some strange special-purpose objects:
5749
5750 @table @code
5751 @item charset
5752 @itemx coding-system
5753 Objects used when MULE, or multi-lingual/Asian-language, support is
5754 enabled.
5755 @item color-instance
5756 @itemx font-instance
5757 @itemx image-instance
5758 An object that encapsulates a window-system resource; instances are
5759 mostly used internally but are exposed on the Lisp level for cleanness
5760 of the specifier model and because it's occasionally useful for Lisp
5761 program to create or query the properties of instances.
5762 @item subwindow
5763 An object that encapsulate a @dfn{subwindow} resource, i.e. a
5764 window-system child window that is drawn into by an external process;
5765 this object should be integrated into the glyph system but isn't yet,
5766 and may change form when this is done.
5767 @item tooltalk-message
5768 @itemx tooltalk-pattern
5769 Objects that represent resources used in the ToolTalk interprocess
5770 communication protocol.
5771 @item toolbar-button
5772 An object used in conjunction with the toolbar.
5773 @end table
5774
5775 And objects that are only used internally:
5776
5777 @table @code
5778 @item opaque
5779 A generic object for encapsulating arbitrary memory; this allows you the
5780 generality of @code{malloc()} and the convenience of the Lisp object
5781 system.
5782 @item lstream
5783 A buffering I/O stream, used to provide a unified interface to anything
5784 that can accept output or provide input, such as a file descriptor, a
5785 stdio stream, a chunk of memory, a Lisp buffer, a Lisp string, etc.;
5786 it's a Lisp object to make its memory management more convenient.
5787 @item char-table-entry
5788 Subsidiary objects in the internal char-table representation.
5789 @item extent-auxiliary
5790 @itemx menubar-data
5791 @itemx toolbar-data
5792 Various special-purpose objects that are basically just used to
5793 encapsulate memory for particular subsystems, similar to the more
5794 general ``opaque'' object.
5795 @item symbol-value-forward
5796 @itemx symbol-value-buffer-local
5797 @itemx symbol-value-varalias
5798 @itemx symbol-value-lisp-magic
5799 Special internal-only objects that are placed in the value cell of a
5800 symbol to indicate that there is something special with this variable --
5801 e.g. it has no value, it mirrors another variable, or it mirrors some C
5802 variable; there is really only one kind of object, called a
5803 @dfn{symbol-value-magic}, but it is sort-of halfway kludged into
5804 semi-different object types.
5805 @end table
5806
5807 @cindex permanent objects
5808 @cindex temporary objects
5809 Some types of objects are @dfn{permanent}, meaning that once created,
5810 they do not disappear until explicitly destroyed, using a function such
5811 as @code{delete-buffer}, @code{delete-window}, @code{delete-frame}, etc.
5812 Others will disappear once they are not longer used, through the garbage
5813 collection mechanism. Buffers, frames, windows, devices, and processes
5814 are among the objects that are permanent. Note that some objects can go
5815 both ways: Faces can be created either way; extents are normally
5816 permanent, but detached extents (extents not referring to any text, as
5817 happens to some extents when the text they are referring to is deleted)
5818 are temporary. Note that some permanent objects, such as faces and
5819 coding systems, cannot be deleted. Note also that windows are unique in
5820 that they can be @emph{undeleted} after having previously been
5821 deleted. (This happens as a result of restoring a window configuration.)
5822
5823 @cindex read syntax
5824 Many types of objects have a @dfn{read syntax}, i.e. a way of
5825 specifying an object of that type in Lisp code. When you load a Lisp
5826 file, or type in code to be evaluated, what really happens is that the
5827 function @code{read} is called, which reads some text and creates an object
5828 based on the syntax of that text; then @code{eval} is called, which
5829 possibly does something special; then this loop repeats until there's
5830 no more text to read. (@code{eval} only actually does something special
5831 with symbols, which causes the symbol's value to be returned,
5832 similar to referencing a variable; and with conses [i.e. lists],
5833 which cause a function invocation. All other values are returned
5834 unchanged.)
5835
5836 The read syntax
5837
5838 @example
5839 17297
5840 @end example
5841
5842 converts to an integer whose value is 17297.
5843
5844 @example
5845 355/113
5846 @end example
5847
5848 converts to a ratio commonly used to approximate @emph{pi} when ratios
5849 are configured, and otherwise to a symbol whose name is ``355/113'' (for
5850 backward compatibility).
5851
5852 @example
5853 1.983e-4
5854 @end example
5855
5856 converts to a float whose value is 1.983e-4, or .0001983.
5857
5858 @example
5859 ?b
5860 @end example
5861
5862 converts to a char that represents the lowercase letter b.
5863
5864 @example
5865 ?^[$(B#&^[(B
5866 @end example
5867
5868 (where @samp{^[} actually is an @samp{ESC} character) converts to a
5869 particular Kanji character when using an ISO2022-based coding system for
5870 input. (To decode this goo: @samp{ESC} begins an escape sequence;
5871 @samp{ESC $ (} is a class of escape sequences meaning ``switch to a
5872 94x94 character set''; @samp{ESC $ ( B} means ``switch to Japanese
5873 Kanji''; @samp{#} and @samp{&} collectively index into a 94-by-94 array
5874 of characters [subtract 33 from the ASCII value of each character to get
5875 the corresponding index]; @samp{ESC (} is a class of escape sequences
5876 meaning ``switch to a 94 character set''; @samp{ESC (B} means ``switch
5877 to US ASCII''. It is a coincidence that the letter @samp{B} is used to
5878 denote both Japanese Kanji and US ASCII. If the first @samp{B} were
5879 replaced with an @samp{A}, you'd be requesting a Chinese Hanzi character
5880 from the GB2312 character set.)
5881
5882 @example
5883 "foobar"
5884 @end example
5885
5886 converts to a string.
5887
5888 @example
5889 foobar
5890 @end example
5891
5892 converts to a symbol whose name is @code{"foobar"}. This is done by
5893 looking up the string equivalent in the global variable
5894 @code{obarray}, whose contents should be an obarray. If no symbol
5895 is found, a new symbol with the name @code{"foobar"} is automatically
5896 created and added to @code{obarray}; this process is called
5897 @dfn{interning} the symbol.
5898 @cindex interning
5899
5900 @example
5901 (foo . bar)
5902 @end example
5903
5904 converts to a cons cell containing the symbols @code{foo} and @code{bar}.
5905
5906 @example
5907 (1 a 2.5)
5908 @end example
5909
5910 converts to a three-element list containing the specified objects
5911 (note that a list is actually a set of nested conses; see the
5912 XEmacs Lisp Reference).
5913
5914 @example
5915 [1 a 2.5]
5916 @end example
5917
5918 converts to a three-element vector containing the specified objects.
5919
5920 @example
5921 #[... ... ... ...]
5922 @end example
5923
5924 converts to a compiled-function object (the actual contents are not
5925 shown since they are not relevant here; look at a file that ends with
5926 @file{.elc} for examples).
5927
5928 @example
5929 #*01110110
5930 @end example
5931
5932 converts to a bit-vector.
5933
5934 @example
5935 #s(hash-table ... ...)
5936 @end example
5937
5938 converts to a hash table (the actual contents are not shown).
5939
5940 @example
5941 #s(range-table ... ...)
5942 @end example
5943
5944 converts to a range table (the actual contents are not shown).
5945
5946 @example
5947 #s(char-table ... ...)
5948 @end example
5949
5950 converts to a char table (the actual contents are not shown).
5951
5952 Note that the @code{#s()} syntax is the general syntax for structures,
5953 which are not really implemented in XEmacs Lisp but should be.
5954
5955 When an object is printed out (using @code{print} or a related
5956 function), the read syntax is used, so that the same object can be read
5957 in again.
5958
5959 The other objects do not have read syntaxes, usually because it does not
5960 really make sense to create them in this fashion (i.e. processes, where
5961 it doesn't make sense to have a subprocess created as a side effect of
5962 reading some Lisp code), or because they can't be created at all
5963 (e.g. subrs). Permanent objects, as a rule, do not have a read syntax;
5964 nor do most complex objects, which contain too much state to be easily
5965 initialized through a read syntax.
5966
5967 @node How Lisp Objects Are Represented in C, Allocation of Objects in XEmacs Lisp, The XEmacs Object System (Abstractly Speaking), Top
5968 @chapter How Lisp Objects Are Represented in C
5969 @cindex Lisp objects are represented in C, how
5970 @cindex objects are represented in C, how Lisp
5971 @cindex represented in C, how Lisp objects are
5972
5973 Lisp objects are represented in C using a 32-bit or 64-bit machine word
5974 (depending on the processor; i.e. DEC Alphas use 64-bit Lisp objects and
5975 most other processors use 32-bit Lisp objects). The representation
5976 stuffs a pointer together with a tag, as follows:
5977
5978 @example
5979 [ 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ]
5980 [ 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ]
5981
5982 <---------------------------------------------------------> <->
5983 a pointer to a structure, or an integer tag
5984 @end example
5985
5986 A tag of 00 is used for all pointer object types, a tag of 10 is used
5987 for characters, and the other two tags 01 and 11 are joined together to
5988 form the integer object type. This representation gives us 31 bit
5989 integers and 30 bit characters, while pointers are represented directly
5990 without any bit masking or shifting. This representation, though,
5991 assumes that pointers to structs are always aligned to multiples of 4,
5992 so the lower 2 bits are always zero.
5993
5994 Lisp objects use the typedef @code{Lisp_Object}, but the actual C type
5995 used for the Lisp object can vary. It can be either a simple type
5996 (@code{long} on the DEC Alpha, @code{int} on other machines) or a
5997 structure whose fields are bit fields that line up properly (actually, a
5998 union of structures is used). Generally the simple integral type is
5999 preferable because it ensures that the compiler will actually use a
6000 machine word to represent the object (some compilers will use more
6001 general and less efficient code for unions and structs even if they can
6002 fit in a machine word). The union type, however, has the advantage of
6003 stricter type checking. If you accidentally pass an integer where a Lisp
6004 object is desired, you get a compile error. The choice of which type
6005 to use is determined by the preprocessor constant @code{USE_UNION_TYPE}
6006 which is defined via the @code{--use-union-type} option to
6007 @code{configure}.
6008
6009 Various macros are used to convert between Lisp_Objects and the
6010 corresponding C type. Macros of the form @code{XINT()}, @code{XCHAR()},
6011 @code{XSTRING()}, @code{XSYMBOL()}, do any required bit shifting and/or
6012 masking and cast it to the appropriate type. @code{XINT()} needs to be
6013 a bit tricky so that negative numbers are properly sign-extended. Since
6014 integers are stored left-shifted, if the right-shift operator does an
6015 arithmetic shift (i.e. it leaves the most-significant bit as-is rather
6016 than shifting in a zero, so that it mimics a divide-by-two even for
6017 negative numbers) the shift to remove the tag bit is enough. This is
6018 the case on all the systems we support.
6019
6020 Note that when @code{ERROR_CHECK_TYPECHECK} is defined, the converter
6021 macros become more complicated---they check the tag bits and/or the
6022 type field in the first four bytes of a record type to ensure that the
6023 object is really of the correct type. This is great for catching places
6024 where an incorrect type is being dereferenced---this typically results
6025 in a pointer being dereferenced as the wrong type of structure, with
6026 unpredictable (and sometimes not easily traceable) results.
6027
6028 There are similar @code{XSET@var{TYPE}()} macros that construct a Lisp
6029 object. These macros are of the form @code{XSET@var{TYPE}
6030 (@var{lvalue}, @var{result})}, i.e. they have to be a statement rather
6031 than just used in an expression. The reason for this is that standard C
6032 doesn't let you ``construct'' a structure (but GCC does). Granted, this
6033 sometimes isn't too convenient; for the case of integers, at least, you
6034 can use the function @code{make_int()}, which constructs and
6035 @emph{returns} an integer Lisp object. Note that the
6036 @code{XSET@var{TYPE}()} macros are also affected by
6037 @code{ERROR_CHECK_TYPECHECK} and make sure that the structure is of the
6038 right type in the case of record types, where the type is contained in
6039 the structure.
6040
6041 The C programmer is responsible for @strong{guaranteeing} that a
6042 Lisp_Object is the correct type before using the @code{X@var{TYPE}}
6043 macros. This is especially important in the case of lists. Use
6044 @code{XCAR} and @code{XCDR} if a Lisp_Object is certainly a cons cell,
6045 else use @code{Fcar()} and @code{Fcdr()}. Trust other C code, but not
6046 Lisp code. On the other hand, if XEmacs has an internal logic error,
6047 it's better to crash immediately, so sprinkle @code{assert()}s and
6048 ``unreachable'' @code{abort()}s liberally about the source code. Where
6049 performance is an issue, use @code{type_checking_assert},
6050 @code{bufpos_checking_assert}, and @code{gc_checking_assert}, which do
6051 nothing unless the corresponding configure error checking flag was
6052 specified.
6053
6054 @node Allocation of Objects in XEmacs Lisp, The Lisp Reader and Compiler, How Lisp Objects Are Represented in C, Top
5784 @chapter Allocation of Objects in XEmacs Lisp 6055 @chapter Allocation of Objects in XEmacs Lisp
5785 @cindex allocation of objects in XEmacs Lisp 6056 @cindex allocation of objects in XEmacs Lisp
5786 @cindex objects in XEmacs Lisp, allocation of 6057 @cindex objects in XEmacs Lisp, allocation of
5787 @cindex Lisp objects, allocation of in XEmacs 6058 @cindex Lisp objects, allocation of in XEmacs
5788 6059
7058 @cindex function, compiled 7329 @cindex function, compiled
7059 7330
7060 Not yet documented. 7331 Not yet documented.
7061 7332
7062 7333
7063 @node Dumping, Events and the Event Loop, Allocation of Objects in XEmacs Lisp, Top 7334 @node The Lisp Reader and Compiler, Evaluation; Stack Frames; Bindings, Allocation of Objects in XEmacs Lisp, Top
7064 @chapter Dumping 7335 @chapter The Lisp Reader and Compiler
7065 @cindex dumping 7336 @cindex Lisp reader and compiler, the
7337 @cindex reader and compiler, the Lisp
7338 @cindex compiler, the Lisp reader and
7339
7340 Not yet documented.
7341
7342 @node Evaluation; Stack Frames; Bindings, Symbols and Variables, The Lisp Reader and Compiler, Top
7343 @chapter Evaluation; Stack Frames; Bindings
7344 @cindex evaluation; stack frames; bindings
7345 @cindex stack frames; bindings, evaluation;
7346 @cindex bindings, evaluation; stack frames;
7066 7347
7067 @menu 7348 @menu
7068 * Dumping Justification:: 7349 * Evaluation::
7069 * Overview:: 7350 * Dynamic Binding; The specbinding Stack; Unwind-Protects::
7070 * Data descriptions:: 7351 * Simple Special Forms::
7071 * Dumping phase:: 7352 * Catch and Throw::
7072 * Reloading phase:: 7353 * Error Trapping::
7073 * Remaining issues::
7074 @end menu 7354 @end menu
7075 7355
7076 @node Dumping Justification, Overview, Dumping, Dumping 7356 @node Evaluation, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings, Evaluation; Stack Frames; Bindings
7077 @section Dumping Justification 7357 @section Evaluation
7078 @cindex dumping, justification 7358 @cindex evaluation
7079 7359
7080 The C code of XEmacs is just a Lisp engine with a lot of built-in 7360 @code{Feval()} evaluates the form (a Lisp object) that is passed to
7081 primitives useful for writing an editor. The editor itself is written 7361 it. Note that evaluation is only non-trivial for two types of objects:
7082 mostly in Lisp, and represents around 100K lines of code. Loading and 7362 symbols and conses. A symbol is evaluated simply by calling
7083 executing the initialization of all this code takes a bit a time (five 7363 @code{symbol-value} on it and returning the value.
7084 to ten times the usual startup time of current xemacs) and requires 7364
7085 having all the lisp source files around. Having to reload them each 7365 Evaluating a cons means calling a function. First, @code{eval} checks
7086 time the editor is started would not be acceptable. 7366 to see if garbage-collection is necessary, and calls
7087 7367 @code{garbage_collect_1()} if so. It then increases the evaluation
7088 The traditional solution to this problem is called dumping: the build 7368 depth by 1 (@code{lisp_eval_depth}, which is always less than
7089 process first creates the lisp engine under the name @file{temacs}, then 7369 @code{max_lisp_eval_depth}) and adds an element to the linked list of
7090 runs it until it has finished loading and initializing all the lisp 7370 @code{struct backtrace}'s (@code{backtrace_list}). Each such structure
7091 code, and eventually creates a new executable called @file{xemacs} 7371 contains a pointer to the function being called plus a list of the
7092 including both the object code in @file{temacs} and all the contents of 7372 function's arguments. Originally these values are stored unevalled, and
7093 the memory after the initialization. 7373 as they are evaluated, the backtrace structure is updated. Garbage
7094 7374 collection pays attention to the objects pointed to in the backtrace
7095 This solution, while working, has a huge problem: the creation of the 7375 structures (garbage collection might happen while a function is being
7096 new executable from the actual contents of memory is an extremely 7376 called or while an argument is being evaluated, and there could easily
7097 system-specific process, quite error-prone, and which interferes with a 7377 be no other references to the arguments in the argument list; once an
7098 lot of system libraries (like malloc). It is even getting worse 7378 argument is evaluated, however, the unevalled version is not needed by
7099 nowadays with libraries using constructors which are automatically 7379 eval, and so the backtrace structure is changed).
7100 called when the program is started (even before @code{main()}) which tend to 7380
7101 crash when they are called multiple times, once before dumping and once 7381 At this point, the function to be called is determined by looking at
7102 after (IRIX 6.x @file{libz.so} pulls in some C++ image libraries thru 7382 the car of the cons (if this is a symbol, its function definition is
7103 dependencies which have this problem). Writing the dumper is also one 7383 retrieved and the process repeated). The function should then consist
7104 of the most difficult parts of porting XEmacs to a new operating system. 7384 of either a @code{Lisp_Subr} (built-in function written in C), a
7105 Basically, `dumping' is an operation that is just not officially 7385 @code{Lisp_Compiled_Function} object, or a cons whose car is one of the
7106 supported on many operating systems. 7386 symbols @code{autoload}, @code{macro} or @code{lambda}.
7107 7387
7108 The aim of the portable dumper is to solve the same problem as the 7388 If the function is a @code{Lisp_Subr}, the lisp object points to a
7109 system-specific dumper, that is to be able to reload quickly, using only 7389 @code{struct Lisp_Subr} (created by @code{DEFUN()}), which contains a
7110 a small number of files, the fully initialized lisp part of the editor, 7390 pointer to the C function, a minimum and maximum number of arguments
7111 without any system-specific hacks. 7391 (or possibly the special constants @code{MANY} or @code{UNEVALLED}), a
7112 7392 pointer to the symbol referring to that subr, and a couple of other
7113 @node Overview, Data descriptions, Dumping Justification, Dumping 7393 things. If the subr wants its arguments @code{UNEVALLED}, they are
7114 @section Overview 7394 passed raw as a list. Otherwise, an array of evaluated arguments is
7115 @cindex dumping overview 7395 created and put into the backtrace structure, and either passed whole
7116 7396 (@code{MANY}) or each argument is passed as a C argument.
7117 The portable dumping system has to: 7397
7118 7398 If the function is a @code{Lisp_Compiled_Function},
7399 @code{funcall_compiled_function()} is called. If the function is a
7400 lambda list, @code{funcall_lambda()} is called. If the function is a
7401 macro, [..... fill in] is done. If the function is an autoload,
7402 @code{do_autoload()} is called to load the definition and then eval
7403 starts over [explain this more].
7404
7405 When @code{Feval()} exits, the evaluation depth is reduced by one, the
7406 debugger is called if appropriate, and the current backtrace structure
7407 is removed from the list.
7408
7409 Both @code{funcall_compiled_function()} and @code{funcall_lambda()} need
7410 to go through the list of formal parameters to the function and bind
7411 them to the actual arguments, checking for @code{&rest} and
7412 @code{&optional} symbols in the formal parameters and making sure the
7413 number of actual arguments is correct.
7414 @code{funcall_compiled_function()} can do this a little more
7415 efficiently, since the formal parameter list can be checked for sanity
7416 when the compiled function object is created.
7417
7418 @code{funcall_lambda()} simply calls @code{Fprogn} to execute the code
7419 in the lambda list.
7420
7421 @code{funcall_compiled_function()} calls the real byte-code interpreter
7422 @code{execute_optimized_program()} on the byte-code instructions, which
7423 are converted into an internal form for faster execution.
7424
7425 When a compiled function is executed for the first time by
7426 @code{funcall_compiled_function()}, or during the dump phase of building
7427 XEmacs, the byte-code instructions are converted from a
7428 @code{Lisp_String} (which is inefficient to access, especially in the
7429 presence of MULE) into a @code{Lisp_Opaque} object containing an array
7430 of unsigned char, which can be directly executed by the byte-code
7431 interpreter. At this time the byte code is also analyzed for validity
7432 and transformed into a more optimized form, so that
7433 @code{execute_optimized_program()} can really fly.
7434
7435 Here are some of the optimizations performed by the internal byte-code
7436 transformer:
7119 @enumerate 7437 @enumerate
7120 @item 7438 @item
7121 At dump time, write all initialized, non-quickly-rebuildable data to a 7439 References to the @code{constants} array are checked for out-of-range
7122 file [Note: currently named @file{xemacs.dmp}, but the name will 7440 indices, so that the byte interpreter doesn't have to.
7123 change], along with all information needed for the reloading. 7441 @item
7124 7442 References to the @code{constants} array that will be used as a Lisp
7125 @item 7443 variable are checked for being correct non-constant (i.e. not @code{t},
7126 When starting xemacs, reload the dump file, relocate it to its new 7444 @code{nil}, or @code{keywordp}) symbols, so that the byte interpreter
7127 starting address if needed, and reinitialize all pointers to this 7445 doesn't have to.
7128 data. Also, rebuild all the quickly rebuildable data. 7446 @item
7447 The maximum number of variable bindings in the byte-code is
7448 pre-computed, so that space on the @code{specpdl} stack can be
7449 pre-reserved once for the whole function execution.
7450 @item
7451 All byte-code jumps are relative to the current program counter instead
7452 of the start of the program, thereby saving a register.
7453 @item
7454 One-byte relative jumps are converted from the byte-code form of unsigned
7455 chars offset by 127 to machine-friendly signed chars.
7129 @end enumerate 7456 @end enumerate
7130 7457
7131 Note: As of 21.5.18, the dump file has been moved inside of the 7458 Of course, this transformation of the @code{instructions} should not be
7132 executable, although there are still problems with this on some systems. 7459 visible to the user, so @code{Fcompiled_function_instructions()} needs
7133 7460 to know how to convert the optimized opaque object back into a Lisp
7134 @node Data descriptions, Dumping phase, Overview, Dumping 7461 string that is identical to the original string from the @file{.elc}
7135 @section Data descriptions 7462 file. (Actually, the resulting string may (rarely) contain slightly
7136 @cindex dumping data descriptions 7463 different, yet equivalent, byte code.)
7137 7464
7138 The more complex task of the dumper is to be able to write memory blocks 7465 @code{Ffuncall()} implements Lisp @code{funcall}. @code{(funcall fun
7139 on the heap (lisp objects, i.e. lrecords, and C-allocated memory, such 7466 x1 x2 x3 ...)} is equivalent to @code{(eval (list fun (quote x1) (quote
7140 as structs and arrays) to disk and reload them at a different address, 7467 x2) (quote x3) ...))}. @code{Ffuncall()} contains its own code to do
7141 updating all the pointers they include in the process. This is done by 7468 the evaluation, however, and is very similar to @code{Feval()}.
7142 using external data descriptions that give information about the layout 7469
7143 of the blocks in memory. 7470 From the performance point of view, it is worth knowing that most of the
7144 7471 time in Lisp evaluation is spent executing @code{Lisp_Subr} and
7145 The specification of these descriptions is in lrecord.h. A description 7472 @code{Lisp_Compiled_Function} objects via @code{Ffuncall()} (not
7146 of an lrecord is an array of struct memory_description. Each of these 7473 @code{Feval()}).
7147 structs include a type, an offset in the block and some optional 7474
7148 parameters depending on the type. For instance, here is the string 7475 @code{Fapply()} implements Lisp @code{apply}, which is very similar to
7149 description: 7476 @code{funcall} except that if the last argument is a list, the result is the
7477 same as if each of the arguments in the list had been passed separately.
7478 @code{Fapply()} does some business to expand the last argument if it's a
7479 list, then calls @code{Ffuncall()} to do the work.
7480
7481 @code{apply1()}, @code{call0()}, @code{call1()}, @code{call2()}, and
7482 @code{call3()} call a function, passing it the argument(s) given (the
7483 arguments are given as separate C arguments rather than being passed as
7484 an array). @code{apply1()} uses @code{Fapply()} while the others use
7485 @code{Ffuncall()} to do the real work.
7486
7487 @node Dynamic Binding; The specbinding Stack; Unwind-Protects, Simple Special Forms, Evaluation, Evaluation; Stack Frames; Bindings
7488 @section Dynamic Binding; The specbinding Stack; Unwind-Protects
7489 @cindex dynamic binding; the specbinding stack; unwind-protects
7490 @cindex binding; the specbinding stack; unwind-protects, dynamic
7491 @cindex specbinding stack; unwind-protects, dynamic binding; the
7492 @cindex unwind-protects, dynamic binding; the specbinding stack;
7150 7493
7151 @example 7494 @example
7152 static const struct memory_description string_description[] = @{ 7495 struct specbinding
7153 @{ XD_BYTECOUNT, offsetof (Lisp_String, size) @}, 7496 @{
7154 @{ XD_OPAQUE_DATA_PTR, offsetof (Lisp_String, data), XD_INDIRECT(0, 1) @}, 7497 Lisp_Object symbol;
7155 @{ XD_LISP_OBJECT, offsetof (Lisp_String, plist) @}, 7498 Lisp_Object old_value;
7156 @{ XD_END @} 7499 Lisp_Object (*func) (Lisp_Object); /* for unwind-protect */
7157 @}; 7500 @};
7158 @end example 7501 @end example
7159 7502
7160 The first line indicates a member of type Bytecount, which is used by 7503 @code{struct specbinding} is used for local-variable bindings and
7161 the next, indirect directive. The second means "there is a pointer to 7504 unwind-protects. @code{specpdl} holds an array of @code{struct specbinding}'s,
7162 some opaque data in the field @code{data}". The length of said data is 7505 @code{specpdl_ptr} points to the beginning of the free bindings in the
7163 given by the expression @code{XD_INDIRECT(0, 1)}, which means "the value 7506 array, @code{specpdl_size} specifies the total number of binding slots
7164 in the 0th line of the description (welcome to C) plus one". The third 7507 in the array, and @code{max_specpdl_size} specifies the maximum number
7165 line means "there is a Lisp_Object member @code{plist} in the Lisp_String 7508 of bindings the array can be expanded to hold. @code{grow_specpdl()}
7166 structure". @code{XD_END} then ends the description. 7509 increases the size of the @code{specpdl} array, multiplying its size by
7167 7510 2 but never exceeding @code{max_specpdl_size} (except that if this
7168 This gives us all the information we need to move around what is pointed 7511 number is less than 400, it is first set to 400).
7169 to by a memory block (C or lrecord) and, by transitivity, everything 7512
7170 that it points to. The only missing information for dumping is the size 7513 @code{specbind()} binds a symbol to a value and is used for local
7171 of the block. For lrecords, this is part of the 7514 variables and @code{let} forms. The symbol and its old value (which
7172 lrecord_implementation, so we don't need to duplicate it. For C blocks 7515 might be @code{Qunbound}, indicating no prior value) are recorded in the
7173 we use a struct sized_memory_description, which includes a size field 7516 specpdl array, and @code{specpdl_size} is increased by 1.
7174 and a pointer to an associated array of memory_description. 7517
7175 7518 @code{record_unwind_protect()} implements an @dfn{unwind-protect},
7176 @node Dumping phase, Reloading phase, Data descriptions, Dumping 7519 which, when placed around a section of code, ensures that some specified
7177 @section Dumping phase 7520 cleanup routine will be executed even if the code exits abnormally
7178 @cindex dumping phase 7521 (e.g. through a @code{throw} or quit). @code{record_unwind_protect()}
7179 7522 simply adds a new specbinding to the @code{specpdl} array and stores the
7180 Dumping is done by calling the function @code{pdump()} (in @file{dumper.c}) which is 7523 appropriate information in it. The cleanup routine can either be a C
7181 invoked from Fdump_emacs (in @file{emacs.c}). This function performs a number 7524 function, which is stored in the @code{func} field, or a @code{progn}
7182 of tasks. 7525 form, which is stored in the @code{old_value} field.
7526
7527 @code{unbind_to()} removes specbindings from the @code{specpdl} array
7528 until the specified position is reached. Each specbinding can be one of
7529 three types:
7530
7531 @enumerate
7532 @item
7533 an unwind-protect with a C cleanup function (@code{func} is not 0, and
7534 @code{old_value} holds an argument to be passed to the function);
7535 @item
7536 an unwind-protect with a Lisp form (@code{func} is 0, @code{symbol}
7537 is @code{nil}, and @code{old_value} holds the form to be executed with
7538 @code{Fprogn()}); or
7539 @item
7540 a local-variable binding (@code{func} is 0, @code{symbol} is not
7541 @code{nil}, and @code{old_value} holds the old value, which is stored as
7542 the symbol's value).
7543 @end enumerate
7544
7545 @node Simple Special Forms, Catch and Throw, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings
7546 @section Simple Special Forms
7547 @cindex special forms, simple
7548
7549 @code{or}, @code{and}, @code{if}, @code{cond}, @code{progn},
7550 @code{prog1}, @code{prog2}, @code{setq}, @code{quote}, @code{function},
7551 @code{let*}, @code{let}, @code{while}
7552
7553 All of these are very simple and work as expected, calling
7554 @code{Feval()} or @code{Fprogn()} as necessary and (in the case of
7555 @code{let} and @code{let*}) using @code{specbind()} to create bindings
7556 and @code{unbind_to()} to undo the bindings when finished.
7557
7558 Note that, with the exception of @code{Fprogn}, these functions are
7559 typically called in real life only in interpreted code, since the byte
7560 compiler knows how to convert calls to these functions directly into
7561 byte code.
7562
7563 @node Catch and Throw, Error Trapping, Simple Special Forms, Evaluation; Stack Frames; Bindings
7564 @section Catch and Throw
7565 @cindex catch and throw
7566 @cindex throw, catch and
7567
7568 @example
7569 struct catchtag
7570 @{
7571 Lisp_Object tag;
7572 Lisp_Object val;
7573 struct catchtag *next;
7574 struct gcpro *gcpro;
7575 jmp_buf jmp;
7576 struct backtrace *backlist;
7577 int lisp_eval_depth;
7578 int pdlcount;
7579 @};
7580 @end example
7581
7582 @code{catch} is a Lisp function that places a catch around a body of
7583 code. A catch is a means of non-local exit from the code. When a catch
7584 is created, a tag is specified, and executing a @code{throw} to this tag
7585 will exit from the body of code caught with this tag, and its value will
7586 be the value given in the call to @code{throw}. If there is no such
7587 call, the code will be executed normally.
7588
7589 Information pertaining to a catch is held in a @code{struct catchtag},
7590 which is placed at the head of a linked list pointed to by
7591 @code{catchlist}. @code{internal_catch()} is passed a C function to
7592 call (@code{Fprogn()} when Lisp @code{catch} is called) and arguments to
7593 give it, and places a catch around the function. Each @code{struct
7594 catchtag} is held in the stack frame of the @code{internal_catch()}
7595 instance that created the catch.
7596
7597 @code{internal_catch()} is fairly straightforward. It stores into the
7598 @code{struct catchtag} the tag name and the current values of
7599 @code{backtrace_list}, @code{lisp_eval_depth}, @code{gcprolist}, and the
7600 offset into the @code{specpdl} array, sets a jump point with @code{_setjmp()}
7601 (storing the jump point into the @code{struct catchtag}), and calls the
7602 function. Control will return to @code{internal_catch()} either when
7603 the function exits normally or through a @code{_longjmp()} to this jump
7604 point. In the latter case, @code{throw} will store the value to be
7605 returned into the @code{struct catchtag} before jumping. When it's
7606 done, @code{internal_catch()} removes the @code{struct catchtag} from
7607 the catchlist and returns the proper value.
7608
7609 @code{Fthrow()} goes up through the catchlist until it finds one with
7610 a matching tag. It then calls @code{unbind_catch()} to restore
7611 everything to what it was when the appropriate catch was set, stores the
7612 return value in the @code{struct catchtag}, and jumps (with
7613 @code{_longjmp()}) to its jump point.
7614
7615 @code{unbind_catch()} removes all catches from the catchlist until it
7616 finds the correct one. Some of the catches might have been placed for
7617 error-trapping, and if so, the appropriate entries on the handlerlist
7618 must be removed (see ``errors''). @code{unbind_catch()} also restores
7619 the values of @code{gcprolist}, @code{backtrace_list}, and
7620 @code{lisp_eval}, and calls @code{unbind_to()} to undo any specbindings
7621 created since the catch.
7622
7623 @node Error Trapping, , Catch and Throw, Evaluation; Stack Frames; Bindings
7624 @section Error Trapping
7625 @cindex error trapping
7626
7627 @subheading call_trapping_problems():
7628
7629 This is equivalent to (*fun) (arg), except that various conditions
7630 can be trapped or inhibited, according to FLAGS.
7631
7632 @itemize @bullet
7633 @item
7634 If FLAGS does not contain NO_INHIBIT_ERRORS, when an error occurs,
7635 the error is caught and a warning is issued, specifying the
7636 specific error that occurred and a backtrace. In that case,
7637 WARNING_STRING should be given, and will be printed at the
7638 beginning of the error to indicate where the error occurred.
7639
7640 @item
7641 If FLAGS does not contain NO_INHIBIT_THROWS, all attempts to
7642 @code{throw} out of the function being called are trapped, and a warning
7643 issued. (Again, WARNING_STRING should be given.)
7644
7645 @item
7646 If FLAGS contains INHIBIT_WARNING_ISSUE, no warnings are issued;
7647 this applies to recursive invocations of call_trapping_problems, too.
7648
7649 @item
7650 If FLAGS contains POSTPONE_WARNING_ISSUE, no warnings are issued;
7651 but values useful for generating a warning are still computed (in
7652 particular, the backtrace), so that the calling function can issue
7653 a warning.
7654
7655 @item
7656 If FLAGS contains ISSUE_WARNINGS_AT_DEBUG_LEVEL, warnings will be
7657 issued, but at level @code{debug}, which normally is below the minimum
7658 specified by @code{log-warning-minimum-level}, meaning such warnings will
7659 be ignored entirely. The user can change this variable, however,
7660 to see the warnings.)
7661
7662 Note: If neither of NO_INHIBIT_THROWS or NO_INHIBIT_ERRORS is
7663 given, you are @strong{guaranteed} that there will be no non-local exits
7664 out of this function.
7665
7666 @item
7667 If FLAGS contains INHIBIT_QUIT, QUIT using C-g is inhibited. (This
7668 is @strong{rarely} a good idea. Unless you use NO_INHIBIT_ERRORS, QUIT is
7669 automatically caught as well, and treated as an error; you can
7670 check for this using EQ (problems->error_conditions, Qquit).
7671
7672 @item
7673 If FLAGS contains UNINHIBIT_QUIT, QUIT checking will be explicitly
7674 turned on. (It will abort the code being called, but will still be
7675 trapped and reported as an error, unless NO_INHIBIT_ERRORS is
7676 given.) This is useful when QUIT checking has been turned off by a
7677 higher-level caller.
7678
7679 @item
7680 If FLAGS contains INHIBIT_GC, garbage collection is inhibited.
7681 This is useful for Lisp called within redisplay, for example.
7682
7683 @item
7684 If FLAGS contains INHIBIT_EXISTING_PERMANENT_DISPLAY_OBJECT_DELETION,
7685 Lisp code is not allowed to delete any window, buffers, frames, devices,
7686 or consoles that were already in existence at the time this function
7687 was called. (However, it's perfectly legal for code to create a new
7688 buffer and then delete it.)
7689
7690 #### It might be useful to have a flag that inhibits deletion of a
7691 specific permanent display object and everything it's attached to
7692 (e.g. a window, and the buffer, frame, device, and console it's
7693 attached to.
7694
7695 @item
7696 If FLAGS contains INHIBIT_EXISTING_BUFFER_TEXT_MODIFICATION, Lisp
7697 code is not allowed to modify the text of any buffers that were
7698 already in existence at the time this function was called.
7699 (However, it's perfectly legal for code to create a new buffer and
7700 then modify its text.)
7701
7702 @quotation
7703 [These last two flags are implemented using global variables
7704 Vdeletable_permanent_display_objects and Vmodifiable_buffers,
7705 which keep track of a list of all buffers or permanent display
7706 objects created since the last time one of these flags was set.
7707 The code that deletes buffers, etc. and modifies buffers checks
7708
7709 @enumerate
7710 @item
7711 if the corresponding flag is set (through the global variable
7712 inhibit_flags or its accessor function get_inhibit_flags()), and
7713
7714 @item
7715 if the object to be modified or deleted is not in the
7716 appropriate list.
7717 @end enumerate
7718
7719 If so, it signals an error.
7720
7721 Recursive calls to call_trapping_problems() are allowed. In
7722 the case of the two flags mentioned above, the current values
7723 of the global variables are stored in an unwind-protect, and
7724 they're reset to nil.]
7725 @end quotation
7726
7727 @item
7728 If FLAGS contains INHIBIT_ENTERING_DEBUGGER, the debugger will not
7729 be entered if an error occurs inside the Lisp code being called,
7730 even when the user has requested an error. In such case, a warning
7731 is issued stating that access to the debugger is denied, unless
7732 INHIBIT_WARNING_ISSUE has also been supplied. This is useful when
7733 calling Lisp code inside redisplay, in menu callbacks, etc. because
7734 in such cases either the display is in an inconsistent state or
7735 doing window operations is explicitly forbidden by the OS, and the
7736 debugger would causes visual changes on the screen and might create
7737 another frame.
7738
7739 @item
7740 If FLAGS contains INHIBIT_ANY_CHANGE_AFFECTING_REDISPLAY, no
7741 changes of any sort to extents, faces, glyphs, buffer text,
7742 specifiers relating to display, other variables relating to
7743 display, splitting, deleting, or resizing windows or frames,
7744 deleting buffers, windows, frames, devices, or consoles, etc. is
7745 allowed. This is for things called absolutely in the middle of
7746 redisplay, which expects things to be @strong{exactly} the same after the
7747 call as before. This isn't completely implemented and needs to be
7748 thought out some more to determine exactly what its semantics are.
7749 For the moment, turning on this flag also turns on
7750
7751 @itemize @minus
7752 @item
7753 INHIBIT_EXISTING_PERMANENT_DISPLAY_OBJECT_DELETION
7754 @item
7755 INHIBIT_EXISTING_BUFFER_TEXT_MODIFICATION
7756 @item
7757 INHIBIT_ENTERING_DEBUGGER
7758 @item
7759 INHIBIT_WARNING_ISSUE
7760 @item
7761 INHIBIT_GC
7762 @end itemize
7763
7764 @item
7765 #### The following five flags are defined, but unimplemented:
7766
7767 #define INHIBIT_EXISTING_CODING_SYSTEM_DELETION (1<<6)
7768 #define INHIBIT_EXISTING_CHARSET_DELETION (1<<7)
7769 #define INHIBIT_PERMANENT_DISPLAY_OBJECT_CREATION (1<<8)
7770 #define INHIBIT_CODING_SYSTEM_CREATION (1<<9)
7771 #define INHIBIT_CHARSET_CREATION (1<<10)
7772
7773 @item
7774 FLAGS containing CALL_WITH_SUSPENDED_ERRORS is a sign that
7775 call_with_suspended_errors() was invoked. This exists only for
7776 debugging purposes -- often we want to break when a signal happens,
7777 but ignore signals from call_with_suspended_errors(), because they
7778 occur often and for legitimate reasons.
7779 @end itemize
7780
7781 If PROBLEM is non-zero, it should be a pointer to a structure into
7782 which exact information about any occurring problems (either an
7783 error or an attempted throw past this boundary).
7784
7785 If a problem occurred and aborted operation (error, quit, or
7786 invalid throw), Qunbound is returned. Otherwise the return value
7787 from the call to (*fun) (arg) is returned.
7788
7789 @node Symbols and Variables, Buffers, Evaluation; Stack Frames; Bindings, Top
7790 @chapter Symbols and Variables
7791 @cindex symbols and variables
7792 @cindex variables, symbols and
7183 7793
7184 @menu 7794 @menu
7185 * Object inventory:: 7795 * Introduction to Symbols::
7186 * Address allocation:: 7796 * Obarrays::
7187 * The header:: 7797 * Symbol Values::
7188 * Data dumping::
7189 * Pointers dumping::
7190 @end menu 7798 @end menu
7191 7799
7192 @node Object inventory, Address allocation, Dumping phase, Dumping phase 7800 @node Introduction to Symbols, Obarrays, Symbols and Variables, Symbols and Variables
7193 @subsection Object inventory 7801 @section Introduction to Symbols
7194 @cindex dumping object inventory 7802 @cindex symbols, introduction to
7195 @cindex memory blocks 7803
7196 7804 A symbol is basically just an object with four fields: a name (a
7197 The first task is to build the list of the objects to dump. This 7805 string), a value (some Lisp object), a function (some Lisp object), and
7198 includes: 7806 a property list (usually a list of alternating keyword/value pairs).
7807 What makes symbols special is that there is usually only one symbol with
7808 a given name, and the symbol is referred to by name. This makes a
7809 symbol a convenient way of calling up data by name, i.e. of implementing
7810 variables. (The variable's value is stored in the @dfn{value slot}.)
7811 Similarly, functions are referenced by name, and the definition of the
7812 function is stored in a symbol's @dfn{function slot}. This means that
7813 there can be a distinct function and variable with the same name. The
7814 property list is used as a more general mechanism of associating
7815 additional values with particular names, and once again the namespace is
7816 independent of the function and variable namespaces.
7817
7818 @node Obarrays, Symbol Values, Introduction to Symbols, Symbols and Variables
7819 @section Obarrays
7820 @cindex obarrays
7821
7822 The identity of symbols with their names is accomplished through a
7823 structure called an obarray, which is just a poorly-implemented hash
7824 table mapping from strings to symbols whose name is that string. (I say
7825 ``poorly implemented'' because an obarray appears in Lisp as a vector
7826 with some hidden fields rather than as its own opaque type. This is an
7827 Emacs Lisp artifact that should be fixed.)
7828
7829 Obarrays are implemented as a vector of some fixed size (which should
7830 be a prime for best results), where each ``bucket'' of the vector
7831 contains one or more symbols, threaded through a hidden @code{next}
7832 field in the symbol. Lookup of a symbol in an obarray, and adding a
7833 symbol to an obarray, is accomplished through standard hash-table
7834 techniques.
7835
7836 The standard Lisp function for working with symbols and obarrays is
7837 @code{intern}. This looks up a symbol in an obarray given its name; if
7838 it's not found, a new symbol is automatically created with the specified
7839 name, added to the obarray, and returned. This is what happens when the
7840 Lisp reader encounters a symbol (or more precisely, encounters the name
7841 of a symbol) in some text that it is reading. There is a standard
7842 obarray called @code{obarray} that is used for this purpose, although
7843 the Lisp programmer is free to create his own obarrays and @code{intern}
7844 symbols in them.
7845
7846 Note that, once a symbol is in an obarray, it stays there until
7847 something is done about it, and the standard obarray @code{obarray}
7848 always stays around, so once you use any particular variable name, a
7849 corresponding symbol will stay around in @code{obarray} until you exit
7850 XEmacs.
7851
7852 Note that @code{obarray} itself is a variable, and as such there is a
7853 symbol in @code{obarray} whose name is @code{"obarray"} and which
7854 contains @code{obarray} as its value.
7855
7856 Note also that this call to @code{intern} occurs only when in the Lisp
7857 reader, not when the code is executed (at which point the symbol is
7858 already around, stored as such in the definition of the function).
7859
7860 You can create your own obarray using @code{make-vector} (this is
7861 horrible but is an artifact) and intern symbols into that obarray.
7862 Doing that will result in two or more symbols with the same name.
7863 However, at most one of these symbols is in the standard @code{obarray}:
7864 You cannot have two symbols of the same name in any particular obarray.
7865 Note that you cannot add a symbol to an obarray in any fashion other
7866 than using @code{intern}: i.e. you can't take an existing symbol and put
7867 it in an existing obarray. Nor can you change the name of an existing
7868 symbol. (Since obarrays are vectors, you can violate the consistency of
7869 things by storing directly into the vector, but let's ignore that
7870 possibility.)
7871
7872 Usually symbols are created by @code{intern}, but if you really want,
7873 you can explicitly create a symbol using @code{make-symbol}, giving it
7874 some name. The resulting symbol is not in any obarray (i.e. it is
7875 @dfn{uninterned}), and you can't add it to any obarray. Therefore its
7876 primary purpose is as a symbol to use in macros to avoid namespace
7877 pollution. It can also be used as a carrier of information, but cons
7878 cells could probably be used just as well.
7879
7880 You can also use @code{intern-soft} to look up a symbol but not create
7881 a new one, and @code{unintern} to remove a symbol from an obarray. This
7882 returns the removed symbol. (Remember: You can't put the symbol back
7883 into any obarray.) Finally, @code{mapatoms} maps over all of the symbols
7884 in an obarray.
7885
7886 @node Symbol Values, , Obarrays, Symbols and Variables
7887 @section Symbol Values
7888 @cindex symbol values
7889 @cindex values, symbol
7890
7891 The value field of a symbol normally contains a Lisp object. However,
7892 a symbol can be @dfn{unbound}, meaning that it logically has no value.
7893 This is internally indicated by storing a special Lisp object, called
7894 @dfn{the unbound marker} and stored in the global variable
7895 @code{Qunbound}. The unbound marker is of a special Lisp object type
7896 called @dfn{symbol-value-magic}. It is impossible for the Lisp
7897 programmer to directly create or access any object of this type.
7898
7899 @strong{You must not let any ``symbol-value-magic'' object escape to
7900 the Lisp level.} Printing any of these objects will cause the message
7901 @samp{INTERNAL EMACS BUG} to appear as part of the print representation.
7902 (You may see this normally when you call @code{debug_print()} from the
7903 debugger on a Lisp object.) If you let one of these objects escape to
7904 the Lisp level, you will violate a number of assumptions contained in
7905 the C code and make the unbound marker not function right.
7906
7907 When a symbol is created, its value field (and function field) are set
7908 to @code{Qunbound}. The Lisp programmer can restore these conditions
7909 later using @code{makunbound} or @code{fmakunbound}, and can query to
7910 see whether the value of function fields are @dfn{bound} (i.e. have a
7911 value other than @code{Qunbound}) using @code{boundp} and
7912 @code{fboundp}. The fields are set to a normal Lisp object using
7913 @code{set} (or @code{setq}) and @code{fset}.
7914
7915 Other symbol-value-magic objects are used as special markers to
7916 indicate variables that have non-normal properties. This includes any
7917 variables that are tied into C variables (setting the variable magically
7918 sets some global variable in the C code, and likewise for retrieving the
7919 variable's value), variables that magically tie into slots in the
7920 current buffer, variables that are buffer-local, etc. The
7921 symbol-value-magic object is stored in the value cell in place of
7922 a normal object, and the code to retrieve a symbol's value
7923 (i.e. @code{symbol-value}) knows how to do special things with them.
7924 This means that you should not just fetch the value cell directly if you
7925 want a symbol's value.
7926
7927 The exact workings of this are rather complex and involved and are
7928 well-documented in comments in @file{buffer.c}, @file{symbols.c}, and
7929 @file{lisp.h}.
7930
7931 @node Buffers, Text, Symbols and Variables, Top
7932 @chapter Buffers
7933 @cindex buffers
7934
7935 @menu
7936 * Introduction to Buffers:: A buffer holds a block of text such as a file.
7937 * Buffer Lists:: Keeping track of all buffers.
7938 * Markers and Extents:: Tagging locations within a buffer.
7939 * The Buffer Object:: The Lisp object corresponding to a buffer.
7940 @end menu
7941
7942 @node Introduction to Buffers, Buffer Lists, Buffers, Buffers
7943 @section Introduction to Buffers
7944 @cindex buffers, introduction to
7945
7946 A buffer is logically just a Lisp object that holds some text.
7947 In this, it is like a string, but a buffer is optimized for
7948 frequent insertion and deletion, while a string is not. Furthermore:
7949
7950 @enumerate
7951 @item
7952 Buffers are @dfn{permanent} objects, i.e. once you create them, they
7953 remain around, and need to be explicitly deleted before they go away.
7954 @item
7955 Each buffer has a unique name, which is a string. Buffers are
7956 normally referred to by name. In this respect, they are like
7957 symbols.
7958 @item
7959 Buffers have a default insertion position, called @dfn{point}.
7960 Inserting text (unless you explicitly give a position) goes at point,
7961 and moves point forward past the text. This is what is going on when
7962 you type text into Emacs.
7963 @item
7964 Buffers have lots of extra properties associated with them.
7965 @item
7966 Buffers can be @dfn{displayed}. What this means is that there
7967 exist a number of @dfn{windows}, which are objects that correspond
7968 to some visible section of your display, and each window has
7969 an associated buffer, and the current contents of the buffer
7970 are shown in that section of the display. The redisplay mechanism
7971 (which takes care of doing this) knows how to look at the
7972 text of a buffer and come up with some reasonable way of displaying
7973 this. Many of the properties of a buffer control how the
7974 buffer's text is displayed.
7975 @item
7976 One buffer is distinguished and called the @dfn{current buffer}. It is
7977 stored in the variable @code{current_buffer}. Buffer operations operate
7978 on this buffer by default. When you are typing text into a buffer, the
7979 buffer you are typing into is always @code{current_buffer}. Switching
7980 to a different window changes the current buffer. Note that Lisp code
7981 can temporarily change the current buffer using @code{set-buffer} (often
7982 enclosed in a @code{save-excursion} so that the former current buffer
7983 gets restored when the code is finished). However, calling
7984 @code{set-buffer} will NOT cause a permanent change in the current
7985 buffer. The reason for this is that the top-level event loop sets
7986 @code{current_buffer} to the buffer of the selected window, each time
7987 it finishes executing a user command.
7988 @end enumerate
7989
7990 Make sure you understand the distinction between @dfn{current buffer}
7991 and @dfn{buffer of the selected window}, and the distinction between
7992 @dfn{point} of the current buffer and @dfn{window-point} of the selected
7993 window. (This latter distinction is explained in detail in the section
7994 on windows.)
7995
7996 @node Buffer Lists, Markers and Extents, Introduction to Buffers, Buffers
7997 @section Buffer Lists
7998 @cindex buffer lists
7999
8000 Recall earlier that buffers are @dfn{permanent} objects, i.e. that
8001 they remain around until explicitly deleted. This entails that there is
8002 a list of all the buffers in existence. This list is actually an
8003 assoc-list (mapping from the buffer's name to the buffer) and is stored
8004 in the global variable @code{Vbuffer_alist}.
8005
8006 The order of the buffers in the list is important: the buffers are
8007 ordered approximately from most-recently-used to least-recently-used.
8008 Switching to a buffer using @code{switch-to-buffer},
8009 @code{pop-to-buffer}, etc. and switching windows using
8010 @code{other-window}, etc. usually brings the new current buffer to the
8011 front of the list. @code{switch-to-buffer}, @code{other-buffer},
8012 etc. look at the beginning of the list to find an alternative buffer to
8013 suggest. You can also explicitly move a buffer to the end of the list
8014 using @code{bury-buffer}.
8015
8016 In addition to the global ordering in @code{Vbuffer_alist}, each frame
8017 has its own ordering of the list. These lists always contain the same
8018 elements as in @code{Vbuffer_alist} although possibly in a different
8019 order. @code{buffer-list} normally returns the list for the selected
8020 frame. This allows you to work in separate frames without things
8021 interfering with each other.
8022
8023 The standard way to look up a buffer given a name is
8024 @code{get-buffer}, and the standard way to create a new buffer is
8025 @code{get-buffer-create}, which looks up a buffer with a given name,
8026 creating a new one if necessary. These operations correspond exactly
8027 with the symbol operations @code{intern-soft} and @code{intern},
8028 respectively. You can also force a new buffer to be created using
8029 @code{generate-new-buffer}, which takes a name and (if necessary) makes
8030 a unique name from this by appending a number, and then creates the
8031 buffer. This is basically like the symbol operation @code{gensym}.
8032
8033 @node Markers and Extents, The Buffer Object, Buffer Lists, Buffers
8034 @section Markers and Extents
8035 @cindex markers and extents
8036 @cindex extents, markers and
8037
8038 Among the things associated with a buffer are things that are
8039 logically attached to certain buffer positions. This can be used to
8040 keep track of a buffer position when text is inserted and deleted, so
8041 that it remains at the same spot relative to the text around it; to
8042 assign properties to particular sections of text; etc. There are two
8043 such objects that are useful in this regard: they are @dfn{markers} and
8044 @dfn{extents}.
8045
8046 A @dfn{marker} is simply a flag placed at a particular buffer
8047 position, which is moved around as text is inserted and deleted.
8048 Markers are used for all sorts of purposes, such as the @code{mark} that
8049 is the other end of textual regions to be cut, copied, etc.
8050
8051 An @dfn{extent} is similar to two markers plus some associated
8052 properties, and is used to keep track of regions in a buffer as text is
8053 inserted and deleted, and to add properties (e.g. fonts) to particular
8054 regions of text. The external interface of extents is explained
8055 elsewhere.
8056
8057 The important thing here is that markers and extents simply contain
8058 buffer positions in them as integers, and every time text is inserted or
8059 deleted, these positions must be updated. In order to minimize the
8060 amount of shuffling that needs to be done, the positions in markers and
8061 extents (there's one per marker, two per extent) are stored in Membpos's.
8062 This means that they only need to be moved when the text is physically
8063 moved in memory; since the gap structure tries to minimize this, it also
8064 minimizes the number of marker and extent indices that need to be
8065 adjusted. Look in @file{insdel.c} for the details of how this works.
8066
8067 One other important distinction is that markers are @dfn{temporary}
8068 while extents are @dfn{permanent}. This means that markers disappear as
8069 soon as there are no more pointers to them, and correspondingly, there
8070 is no way to determine what markers are in a buffer if you are just
8071 given the buffer. Extents remain in a buffer until they are detached
8072 (which could happen as a result of text being deleted) or the buffer is
8073 deleted, and primitives do exist to enumerate the extents in a buffer.
8074
8075 @node The Buffer Object, , Markers and Extents, Buffers
8076 @section The Buffer Object
8077 @cindex buffer object, the
8078 @cindex object, the buffer
8079
8080 Buffers contain fields not directly accessible by the Lisp programmer.
8081 We describe them here, naming them by the names used in the C code.
8082 Many are accessible indirectly in Lisp programs via Lisp primitives.
8083
8084 @table @code
8085 @item name
8086 The buffer name is a string that names the buffer. It is guaranteed to
8087 be unique. @xref{Buffer Names,,, lispref, XEmacs Lisp Reference
8088 Manual}.
8089
8090 @item save_modified
8091 This field contains the time when the buffer was last saved, as an
8092 integer. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference
8093 Manual}.
8094
8095 @item modtime
8096 This field contains the modification time of the visited file. It is
8097 set when the file is written or read. Every time the buffer is written
8098 to the file, this field is compared to the modification time of the
8099 file. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference
8100 Manual}.
8101
8102 @item auto_save_modified
8103 This field contains the time when the buffer was last auto-saved.
8104
8105 @item last_window_start
8106 This field contains the @code{window-start} position in the buffer as of
8107 the last time the buffer was displayed in a window.
8108
8109 @item undo_list
8110 This field points to the buffer's undo list. @xref{Undo,,, lispref,
8111 XEmacs Lisp Reference Manual}.
8112
8113 @item syntax_table_v
8114 This field contains the syntax table for the buffer. @xref{Syntax
8115 Tables,,, lispref, XEmacs Lisp Reference Manual}.
8116
8117 @item downcase_table
8118 This field contains the conversion table for converting text to lower
8119 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
8120
8121 @item upcase_table
8122 This field contains the conversion table for converting text to upper
8123 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
8124
8125 @item case_canon_table
8126 This field contains the conversion table for canonicalizing text for
8127 case-folding search. @xref{Case Tables,,, lispref, XEmacs Lisp
8128 Reference Manual}.
8129
8130 @item case_eqv_table
8131 This field contains the equivalence table for case-folding search.
8132 @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
8133
8134 @item display_table
8135 This field contains the buffer's display table, or @code{nil} if it
8136 doesn't have one. @xref{Display Tables,,, lispref, XEmacs Lisp
8137 Reference Manual}.
8138
8139 @item markers
8140 This field contains the chain of all markers that currently point into
8141 the buffer. Deletion of text in the buffer, and motion of the buffer's
8142 gap, must check each of these markers and perhaps update it.
8143 @xref{Markers,,, lispref, XEmacs Lisp Reference Manual}.
8144
8145 @item backed_up
8146 This field is a flag that tells whether a backup file has been made for
8147 the visited file of this buffer.
8148
8149 @item mark
8150 This field contains the mark for the buffer. The mark is a marker,
8151 hence it is also included on the list @code{markers}. @xref{The Mark,,,
8152 lispref, XEmacs Lisp Reference Manual}.
8153
8154 @item mark_active
8155 This field is non-@code{nil} if the buffer's mark is active.
8156
8157 @item local_var_alist
8158 This field contains the association list describing the variables local
8159 in this buffer, and their values, with the exception of local variables
8160 that have special slots in the buffer object. (Those slots are omitted
8161 from this table.) @xref{Buffer-Local Variables,,, lispref, XEmacs Lisp
8162 Reference Manual}.
8163
8164 @item modeline_format
8165 This field contains a Lisp object which controls how to display the mode
8166 line for this buffer. @xref{Modeline Format,,, lispref, XEmacs Lisp
8167 Reference Manual}.
8168
8169 @item base_buffer
8170 This field holds the buffer's base buffer (if it is an indirect buffer),
8171 or @code{nil}.
8172 @end table
8173
8174 @node Text, Multilingual Support, Buffers, Top
8175 @chapter Text
8176 @cindex text
8177
8178 @menu
8179 * The Text in a Buffer:: Representation of the text in a buffer.
8180 * Ibytes and Ichars:: Representation of individual characters.
8181 * Byte-Char Position Conversion::
8182 * Searching and Matching:: Higher-level algorithms.
8183 @end menu
8184
8185 @node The Text in a Buffer, Ibytes and Ichars, Text, Text
8186 @section The Text in a Buffer
8187 @cindex text in a buffer, the
8188 @cindex buffer, the text in a
8189
8190 The text in a buffer consists of a sequence of zero or more
8191 characters. A @dfn{character} is an integer that logically represents
8192 a letter, number, space, or other unit of text. Most of the characters
8193 that you will typically encounter belong to the ASCII set of characters,
8194 but there are also characters for various sorts of accented letters,
8195 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana,
8196 etc.), Cyrillic and Greek letters, etc. The actual number of possible
8197 characters is quite large.
8198
8199 For now, we can view a character as some non-negative integer that
8200 has some shape that defines how it typically appears (e.g. as an
8201 uppercase A). (The exact way in which a character appears depends on the
8202 font used to display the character.) The internal type of characters in
8203 the C code is an @code{Ichar}; this is just an @code{int}, but using a
8204 symbolic type makes the code clearer.
8205
8206 Between every character in a buffer is a @dfn{buffer position} or
8207 @dfn{character position}. We can speak of the character before or after
8208 a particular buffer position, and when you insert a character at a
8209 particular position, all characters after that position end up at new
8210 positions. When we speak of the character @dfn{at} a position, we
8211 really mean the character after the position. (This schizophrenia
8212 between a buffer position being ``between'' two characters and ``on'' a
8213 character is rampant in Emacs.)
8214
8215 Buffer positions are numbered starting at 1. This means that
8216 position 1 is before the first character, and position 0 is not
8217 valid. If there are N characters in a buffer, then buffer
8218 position N+1 is after the last one, and position N+2 is not valid.
8219
8220 The internal makeup of the Ichar integer varies depending on whether
8221 we have compiled with MULE support. If not, the Ichar integer is an
8222 8-bit integer with possible values from 0 - 255. 0 - 127 are the
8223 standard ASCII characters, while 128 - 255 are the characters from the
8224 ISO-8859-1 character set. If we have compiled with MULE support, an
8225 Ichar is a 19-bit integer, with the various bits having meanings
8226 according to a complex scheme that will be detailed later. The
8227 characters numbered 0 - 255 still have the same meanings as for the
8228 non-MULE case, though.
8229
8230 Internally, the text in a buffer is represented in a fairly simple
8231 fashion: as a contiguous array of bytes, with a @dfn{gap} of some size
8232 in the middle. Although the gap is of some substantial size in bytes,
8233 there is no text contained within it: From the perspective of the text
8234 in the buffer, it does not exist. The gap logically sits at some buffer
8235 position, between two characters (or possibly at the beginning or end of
8236 the buffer). Insertion of text in a buffer at a particular position is
8237 always accomplished by first moving the gap to that position
8238 (i.e. through some block moving of text), then writing the text into the
8239 beginning of the gap, thereby shrinking the gap. If the gap shrinks
8240 down to nothing, a new gap is created. (What actually happens is that a
8241 new gap is ``created'' at the end of the buffer's text, which requires
8242 nothing more than changing a couple of indices; then the gap is
8243 ``moved'' to the position where the insertion needs to take place by
8244 moving up in memory all the text after that position.) Similarly,
8245 deletion occurs by moving the gap to the place where the text is to be
8246 deleted, and then simply expanding the gap to include the deleted text.
8247 (@dfn{Expanding} and @dfn{shrinking} the gap as just described means
8248 just that the internal indices that keep track of where the gap is
8249 located are changed.)
8250
8251 Note that the total amount of memory allocated for a buffer text never
8252 decreases while the buffer is live. Therefore, if you load up a
8253 20-megabyte file and then delete all but one character, there will be a
8254 20-megabyte gap, which won't get any smaller (except by inserting
8255 characters back again). Once the buffer is killed, the memory allocated
8256 for the buffer text will be freed, but it will still be sitting on the
8257 heap, taking up virtual memory, and will not be released back to the
8258 operating system. (However, if you have compiled XEmacs with rel-alloc,
8259 the situation is different. In this case, the space @emph{will} be
8260 released back to the operating system. However, this tends to result in a
8261 noticeable speed penalty.)
8262
8263 Astute readers may notice that the text in a buffer is represented as
8264 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is
8265 a 19-bit integer, which clearly cannot fit in a byte. This means (of
8266 course) that the text in a buffer uses a different representation from
8267 an Ichar: specifically, the 19-bit Ichar becomes a series of one to
8268 four bytes. The conversion between these two representations is complex
8269 and will be described later.
8270
8271 In the non-MULE case, everything is very simple: An Ichar
8272 is an 8-bit value, which fits neatly into one byte.
8273
8274 If we are given a buffer position and want to retrieve the
8275 character at that position, we need to follow these steps:
8276
8277 @enumerate
8278 @item
8279 Pretend there's no gap, and convert the buffer position into a @dfn{byte
8280 index} that indexes to the appropriate byte in the buffer's stream of
8281 textual bytes. By convention, byte indices begin at 1, just like buffer
8282 positions. In the non-MULE case, byte indices and buffer positions are
8283 identical, since one character equals one byte.
8284 @item
8285 Convert the byte index into a @dfn{memory index}, which takes the gap
8286 into account. The memory index is a direct index into the block of
8287 memory that stores the text of a buffer. This basically just involves
8288 checking to see if the byte index is past the gap, and if so, adding the
8289 size of the gap to it. By convention, memory indices begin at 1, just
8290 like buffer positions and byte indices, and when referring to the
8291 position that is @dfn{at} the gap, we always use the memory position at
8292 the @emph{beginning}, not at the end, of the gap.
8293 @item
8294 Fetch the appropriate bytes at the determined memory position.
8295 @item
8296 Convert these bytes into an Ichar.
8297 @end enumerate
8298
8299 In the non-Mule case, (3) and (4) boil down to a simple one-byte
8300 memory access.
8301
8302 Note that we have defined three types of positions in a buffer:
8303
8304 @enumerate
8305 @item
8306 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos}
8307 @item
8308 @dfn{byte indices}, typedef @code{Bytebpos}
8309 @item
8310 @dfn{memory indices}, typedef @code{Membpos}
8311 @end enumerate
8312
8313 All three typedefs are just @code{int}s, but defining them this way makes
8314 things a lot clearer.
8315
8316 Most code works with buffer positions. In particular, all Lisp code
8317 that refers to text in a buffer uses buffer positions. Lisp code does
8318 not know that byte indices or memory indices exist.
8319
8320 Finally, we have a typedef for the bytes in a buffer. This is a
8321 @code{Ibyte}, which is an unsigned char. Referring to them as
8322 Ibytes underscores the fact that we are working with a string of bytes
8323 in the internal Emacs buffer representation rather than in one of a
8324 number of possible alternative representations (e.g. EUC-encoded text,
8325 etc.).
8326
8327 @node Ibytes and Ichars, Byte-Char Position Conversion, The Text in a Buffer, Text
8328 @section Ibytes and Ichars
8329 @cindex Ibytes and Ichars
8330 @cindex Ichars, Ibytes and
8331
8332 Not yet documented.
8333
8334 @node Byte-Char Position Conversion, Searching and Matching, Ibytes and Ichars, Text
8335 @section Byte-Char Position Conversion
8336 @cindex byte-char position conversion
8337 @cindex position conversion, byte-char
8338 @cindex conversion, byte-char position
8339
8340 Oct 2004:
8341
8342 This is what I wrote when describing the previous algorithm:
8343
8344 @quotation
8345 The basic algorithm we use is to keep track of a known region of
8346 characters in each buffer, all of which are of the same width. We keep
8347 track of the boundaries of the region in both Charbpos and Bytebpos
8348 coordinates and also keep track of the char width, which is 1 - 4 bytes.
8349 If the position we're translating is not in the known region, then we
8350 invoke a function to update the known region to surround the position in
8351 question. This assumes locality of reference, which is usually the
8352 case.
8353
8354 Note that the function to update the known region can be simple or
8355 complicated depending on how much information we cache. In addition to
8356 the known region, we always cache the correct conversions for point,
8357 BEGV, and ZV, and in addition to this we cache 16 positions where the
8358 conversion is known. We only look in the cache or update it when we
8359 need to move the known region more than a certain amount (currently 50
8360 chars), and then we throw away a "random" value and replace it with the
8361 newly calculated value.
8362
8363 Finally, we maintain an extra flag that tracks whether the buffer is
8364 entirely ASCII, to speed up the conversions even more. This flag is
8365 actually of dubious value because in an entirely-ASCII buffer the known
8366 region will always span the entire buffer (in fact, we update the flag
8367 based on this fact), and so all we're saving is a few machine cycles.
8368
8369 A potentially smarter method than what we do with known regions and
8370 cached positions would be to keep some sort of pseudo-extent layer over
8371 the buffer; maybe keep track of the charbpos/bytebpos correspondence at
8372 the beginning of each line, which would allow us to do a binary search
8373 over the pseudo-extents to narrow things down to the correct line, at
8374 which point you could use a linear movement method. This would also
8375 mesh well with efficiently implementing a line-numbering scheme.
8376 However, you have to weigh the amount of time spent updating the cache
8377 vs. the savings that result from it. In reality, we modify the buffer
8378 far less often than we access it, so a cache of this sort that provides
8379 guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a
8380 maximum on the cache size) would indeed be a win, particularly in very
8381 large buffers. If we ever implement this, we should probably set a
8382 reasonably high minimum below which we use the old method, because the
8383 time spent updating the fancy cache would likely become dominant when
8384 making buffer modifications in smaller buffers.
8385
8386 Note also that we have to multiply or divide by the char width in order
8387 to convert the positions. We do some tricks to avoid ever actually
8388 having to do a multiply or divide, because that is typically an
8389 expensive operation (esp. divide). Multiplying or dividing by 1, 2, or
8390 4 can be implemented simply as a shift left or shift right, and we keep
8391 track of a shifter value (0, 1, or 2) indicating how much to shift.
8392 Multiplying by 3 can be implemented by doubling and then adding the
8393 original value. Dividing by 3, alas, cannot be implemented in any
8394 simple shift/subtract method, as far as I know; so we just do a table
8395 lookup. For simplicity, we use a table of size 128K, which indexes the
8396 "divide-by-3" values for the first 64K non-negative numbers. (Note that
8397 we can increase the size up to 384K, i.e. indexing the first 192K
8398 non-negative numbers, while still using shorts in the array.) This also
8399 means that the size of the known region can be at most 64K for
8400 width-three characters.
8401 @end quotation
8402
8403 Unfortunately, it turned out that the implementation had serious problems
8404 which had never been corrected. In particular, the known region had a
8405 large tendency to become zero-length and stay that way.
8406
8407 So I decided to port the algorithm from FSF 21.3, in markers.c.
8408
8409 This algorithm is fairly simple. Instead of using markers I kept the cache
8410 array of known positions from the previous implementation.
8411
8412 Basically, we keep a number of positions cached:
7199 8413
7200 @itemize @bullet 8414 @itemize @bullet
7201 @item lisp objects 8415 @item
7202 @item other memory blocks (C structures, arrays. etc) 8416 the actual end of the buffer
8417 @item
8418 the beginning and end of the accessible region
8419 @item
8420 the value of point
8421 @item
8422 the position of the gap
8423 @item
8424 the last value we computed
8425 @item
8426 a set of positions that are "far away" from previously computed positions
8427 (5000 chars currently; #### perhaps should be smaller)
7203 @end itemize 8428 @end itemize
7204 8429
7205 We end up with one @code{pdump_block_list_elt} per object group (arrays 8430 For each position, we @code{CONSIDER()} it. This means:
7206 of C structs are kept together) which includes a pointer to the first 8431
7207 object of the group, the per-object size and the count of objects in the 8432 @itemize @bullet
7208 group, along with some other information which is initialized later. 8433 @item
7209 8434 If the position is what we're looking for, return it directly.
7210 These entries are linked together in @code{pdump_block_list} structures 8435 @item
7211 and can be enumerated thru either: 8436 Starting with the beginning and end of the buffer, we successively
8437 compute the smallest enclosing range of known positions. If at any
8438 point we discover that this range has the same byte and char length
8439 (i.e. is entirely single-byte), then our computation is trivial.
8440 @item
8441 If at any point we get a small enough range (50 chars currently),
8442 stop considering further positions.
8443 @end itemize
8444
8445 Otherwise, once we have an enclosing range, see which side is closer, and
8446 iterate until we find the desired value. As an optimization, I replaced
8447 the simple loop in FSF with the use of @code{bytecount_to_charcount()},
8448 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or
8449 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.)
8450 These scan 4 or 8 bytes at a time through purely single-byte characters.
8451
8452 If the amount we had to scan was more than our "far away" distance (5000
8453 characters, see above), then cache the new position.
8454
8455 #### Things to do:
8456
8457 @itemize @bullet
8458 @item
8459 Look at the most recent GNU Emacs to see whether anything has changed.
8460 @item
8461 Think about whether it makes sense to try to implement some sort of
8462 known region or list of "known regions", like we had before. This would
8463 be a region of entirely single-byte characters that we can check very
8464 quickly. (Previously I used a range of same-width characters of any
8465 size; but this adds extra complexity and slows down the scanning, and is
8466 probably not worth it.) As part of the scanning process in
8467 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely
8468 single-byte chars, so it should be easy to remember the last one.
8469 Presumably what we should do is keep track of the largest known surrounding
8470 entirely-single-byte region for each of the cache positions as well as
8471 perhaps the last-cached position. We want to be careful not to get bitten
8472 by the previous problem of having the known region getting reset too
8473 often. If we implement this, we might well want to continue scanning
8474 some distance past the desired position (maybe 300-1000 bytes) if we are
8475 in a single-byte range so that we won't end up expanding the known range
8476 one position at a time and entering the function each time.
8477 @item
8478 Think about whether it makes sense to keep the position cache sorted.
8479 This would allow it to be larger and finer-grained in its positions.
8480 Note that with FSF's use of markers, they were sorted, but this
8481 was not really made good use of. With an array, we can do binary searching
8482 to quickly find the smallest range. We would probably want to make use of
8483 the gap-array code in extents.c.
8484 @end itemize
8485
8486 Note that FSF's algorithm checked @strong{ALL} markers, not just the ones cached
8487 by this algorithm. This includes markers created by the user as well as
8488 both ends of any overlays. We could do similarly, and our extents could
8489 keep both byte and character positions rather than just the former. (But
8490 this would probably be overkill. We should just use our cache instead.
8491 Any place an extent was set was surely already visited by the char<-->byte
8492 conversion routines.)
8493
8494 @node Searching and Matching, , Byte-Char Position Conversion, Text
8495 @section Searching and Matching
8496 @cindex searching
8497 @cindex matching
8498
8499 Very incomplete, limited to a brief introduction.
8500
8501 People find the searching and matching code difficult to understand.
8502 And indeed, the details are hard. However, the basic structures are not
8503 so complex. First, there's a hard question with a simple answer. What
8504 about Mule? The answer here is that it turns out that Mule characters
8505 can be matched byte by byte, so neither the search code nor the regular
8506 expression code need take much notice of it at all! Of course, we add
8507 some special features (such as regular expressions that match only
8508 certain charsets), but these do not require new concepts. The main
8509 exception is that wild-card matches in Mule have to be careful to
8510 swallow whole characters. This is handled using the same basic macros
8511 that are used for buffer and string movements.
8512
8513 This will also be true if a UTF-8 representation is used for the
8514 internal encoding.
8515
8516 The complex algorithms for searching are for simple string searches. In
8517 particular, the algorithm used for fast string searching is Boyer-Moore.
8518 This algorithm is based on the idea that if you have a mismatch at a
8519 given position, you can precompute where to restart the search. This
8520 typically means that you can often make many fewer than N character
8521 comparisons, where N is the position at which the match is found, or the
8522 size of the text if it contains no match. That's fast! But it's not
8523 easy. You must ``compile'' the search string into a jump table. See
8524 the source, @file{search.c}, for more information.
8525
8526 Emacs changes the basic algorithms somewhat in order to handle
8527 case-insensitive searches without a full-blown regular expression.
8528
8529 Regular expressions, on the other hand, have a trivial search
8530 implementation: try a match at each position. (Under POSIX rules, it's
8531 a bit more complex, because POSIX requires that you find the
8532 @emph{longest} match in the text. This means you keep a record of the
8533 best match so far, and find all the matches.)
8534
8535 The matching code for regular expressions is quite complex. First, the
8536 regular expression itself is compiled. There are two basic approaches
8537 that could be taken. The first is to compile the expression into tables
8538 to drive a generic finite automaton emulator. This is the approach
8539 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi,
8540 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka
8541 ``The Dragon Book'') as well as being used by the @file{lex} family of
8542 lexical analysis engines.
8543
8544 Emacs uses a somewhat different technique. The expression is compiled
8545 into a form of bytecode, which is interpreted by a special interpreter.
8546 The interpreter itself basically amounts to an inline implementation of
8547 the finite automaton emulator. The advantage of this technique is that
8548 it's easier to add special features, such as control of case-sensitivity
8549 via a global variable.
8550
8551 The compiler is not treated here. See the source, @file{regex.c}. The
8552 interpreter, although it is divided into several functions, and looks
8553 fearsomely complex, is actually quite simple in concept. However,
8554 basically what you're doing there is a strcmp on steroids, right?
8555
8556 @example
8557 int
8558 strcmp (char *p, /* pattern pointer */
8559 char *b) /* buffer pointer */
8560 @{
8561 while (*p++ == *b++)
8562 ;
8563 return *(--p) - *(--b); /* oops, we overshot */
8564 @}
8565 @end example
8566
8567 Really, it's no harder than that. (A bit of a white lie, OK?)
8568
8569 How does the regexp code generalize this?
7212 8570
7213 @enumerate 8571 @enumerate
7214 @item 8572 @item
7215 the @code{pdump_object_table}, an array of @code{pdump_block_list}, one 8573 Depending on the pattern, @code{*b} may have a general relationship to
7216 per lrecord type, indexed by type number. 8574 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is
7217 8575 generalized to include checks for set membership, and context dependent
7218 @item 8576 properties. This depends on @code{&*b}. Of course that's meaningless
7219 the @code{pdump_opaque_data_list}, used for the opaque data which does 8577 in C, so we use @code{b} directly, instead.
7220 not include pointers, and hence does not need descriptions. 8578
7221 8579 @item
7222 @item 8580 Although to ensure the algorithm terminates, @code{b} must advance step
7223 the @code{pdump_desc_table}, which is a vector of 8581 by step, @code{p} can branch and jump.
7224 @code{memory_description}/@code{pdump_block_list} pairs, used for 8582
7225 non-opaque C memory blocks. 8583 @item
8584 The information returned is much greater, including information about
8585 subexpressions.
7226 @end enumerate 8586 @end enumerate
7227 8587
7228 This uses a marking strategy similar to the garbage collector. Some 8588 We'll ignore (3). (2) is mostly interesting when compiling the regular
7229 differences though: 8589 expression. Now we have
8590
8591 @example
8592 @group
8593 enum operator_t @{
8594 accept = 0,
8595 exact,
8596 any,
8597 range,
8598 group, /* actually, these are probably */
8599 repeat, /* turned into conditional code */
8600 /* etc */
8601 @};
8602 @end group
8603
8604 @group
8605 enum status_t @{
8606 working = 0,
8607 matched,
8608 mismatch,
8609 end_of_buffer,
8610 error
8611 @};
8612 @end group
8613
8614 @group
8615 struct pattern @{
8616 enum operator_t operator;
8617 char char_value;
8618 boolean range_table[256];
8619 /* etc, etc */
8620 @};
8621 @end group
8622
8623 @group
8624 char *p, /* pattern pointer */
8625 *b; /* buffer pointer */
8626
8627 enum status_t
8628 match (struct pattern *p, char *b)
8629 @{
8630 enum status_t done = working;
8631
8632 while (!(done = match_1_operator (p, b)))
8633 @{
8634 struct pattern *p1 = p;
8635 p = next_p (p, b);
8636 b = next_b (p1, b);
8637 @}
8638 return done;
8639 @}
8640 @end group
8641 @end example
8642
8643 This format exposes the underlying finite automaton.
8644
8645 All of them have the following structure, except that the @samp{next_*}
8646 functions decide where to jump (for @samp{p}) and whether or not to
8647 increment (for @samp{b}), rather than checking for satisfaction of a
8648 matching condition.
8649
8650 @example
8651 enum status_t
8652 match_1_operator (pattern *p, char *b)
8653 @{
8654 if (! *b) return end_of_buffer;
8655 switch (p->operator)
8656 @{
8657 case accept:
8658 return matched;
8659 case exact:
8660 if (*b != p->char_value) return mismatch; else break;
8661 case any:
8662 break;
8663 case range:
8664 /* range_table is computed in the regexp_compile function */
8665 if (! p->range_table[*b]) return mismatch;
8666 /* etc, etc */
8667 @}
8668 return working;
8669 @}
8670 @end example
8671
8672 Grouping, repetition, and alternation are handled by compiling the
8673 subexpression and calling @code{match (p->subpattern, b)} recursively.
8674
8675 In terms of reading the actual code, there are five optimizations
8676 (obfuscations, if you like) that have been done.
7230 8677
7231 @enumerate 8678 @enumerate
7232 @item 8679 @item
7233 We do not use the mark bit (which does not exist for generic memory blocks 8680 An explicit "failure stack" has been substituted for recursion.
7234 anyway); we use a big hash table instead. 8681
7235 8682 @item
7236 @item 8683 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions
7237 We do not use the mark function of lrecords but instead rely on the 8684 are actually inlined into the @code{match} function for efficiency.
7238 external descriptions. This happens essentially because we need to 8685 Then the pointer movement is interspersed with the matching operations.
7239 follow pointers to generic memory blocks and opaque data in addition to 8686
7240 Lisp_Object members. 8687 @item
8688 If the operator uses buffer context, the buffer pointer movement is
8689 sometimes implicit in the operations retrieving the context.
8690
8691 @item
8692 Some cases are combined into short preparation for individual cases, and
8693 a "fall-through" into combined code for several cases.
8694
8695 @item
8696 The @code{pattern} type is not an explicit @samp{struct}. Instead, the
8697 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the
8698 compiled bytecode. This leads to bizarre code in the interpreter like
8699
8700 @example
8701 case range:
8702 p += *(p + 1); break;
8703 @end example
8704
8705 in @code{next_p}, because the compiled pattern is laid out
8706
8707 @example
8708 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ...
8709 @end example
7241 @end enumerate 8710 @end enumerate
7242 8711
7243 This is done by @code{pdump_register_object()}, which handles 8712 But if you keep your eye on the "switch in a loop" structure, you
7244 Lisp_Object variables, and @code{pdump_register_block()} which handles 8713 should be able to understand the parts you need.
7245 generic memory blocks (C structures, arrays, etc.), which both delegate 8714
7246 the description management to @code{pdump_register_sub()}. 8715 @node Multilingual Support, Consoles; Devices; Frames; Windows, Text, Top
7247 8716 @chapter Multilingual Support
7248 The hash table doubles as a map object to pdump_block_list_elmt (i.e. 8717 @cindex Mule character sets and encodings
7249 allows us to look up a pdump_block_list_elmt with the object it points 8718 @cindex character sets and encodings, Mule
7250 to). Entries are added with @code{pdump_add_block()} and looked up with 8719 @cindex encodings, Mule character sets and
7251 @code{pdump_get_block()}. There is no need for entry removal. The hash 8720
7252 value is computed quite simply from the object pointer by 8721 @emph{NOTE}: There is a great deal of overlapping and redundant
7253 @code{pdump_make_hash()}. 8722 information in this chapter. Ben wrote introductions to Mule issues a
7254 8723 number of times, each time not realizing that he had already written
7255 The roots for the marking are: 8724 another introduction previously. Hopefully, in time these will all be
8725 integrated.
8726
8727 @emph{NOTE}: The information at the top of the source file
8728 @file{text.c} is more complete than the following, and there is also a
8729 list of all other places to look for text/I18N-related info. Also look in
8730 @file{text.h} for info about the DFC and Eistring API's.
8731
8732 Recall that there are two primary ways that text is represented in
8733 XEmacs. The @dfn{buffer} representation sees the text as a series of
8734 bytes (Ibytes), with a variable number of bytes used per character.
8735 The @dfn{character} representation sees the text as a series of integers
8736 (Ichars), one per character. The character representation is a cleaner
8737 representation from a theoretical standpoint, and is thus used in many
8738 cases when lots of manipulations on a string need to be done. However,
8739 the buffer representation is the standard representation used in both
8740 Lisp strings and buffers, and because of this, it is the ``default''
8741 representation that text comes in. The reason for using this
8742 representation is that it's compact and is compatible with ASCII.
8743
8744 @menu
8745 * Introduction to Multilingual Issues #1::
8746 * Introduction to Multilingual Issues #2::
8747 * Introduction to Multilingual Issues #3::
8748 * Introduction to Multilingual Issues #4::
8749 * Character Sets::
8750 * Encodings::
8751 * Internal Mule Encodings::
8752 * Byte/Character Types; Buffer Positions; Other Typedefs::
8753 * Internal Text API's::
8754 * Coding for Mule::
8755 * CCL::
8756 * Microsoft Windows-Related Multilingual Issues::
8757 * Modules for Internationalization::
8758 @end menu
8759
8760 @node Introduction to Multilingual Issues #1, Introduction to Multilingual Issues #2, Multilingual Support, Multilingual Support
8761 @section Introduction to Multilingual Issues #1
8762 @cindex introduction to multilingual issues #1
8763
8764 There is an introduction to these issues in the Lisp Reference manual.
8765 @xref{Internationalization Terminology,,, lispref, XEmacs Lisp Reference
8766 Manual}. Among other documentation that may be of interest to internals
8767 programmers is ISO-2022 (@pxref{ISO 2022,,, lispref, XEmacs Lisp
8768 Reference Manual}) and CCL (@pxref{CCL,,, lispref, XEmacs Lisp Reference
8769 Manual})
8770
8771 @node Introduction to Multilingual Issues #2, Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #1, Multilingual Support
8772 @section Introduction to Multilingual Issues #2
8773 @cindex introduction to multilingual issues #2
8774
8775 @subheading Introduction
8776
8777 This document covers a number of design issues, problems and proposals
8778 with regards to XEmacs MULE. At first we present some definitions and
8779 some aspects of the design that have been agreed upon. Then we present
8780 some issues and problems that need to be addressed, and then I include a
8781 proposal of mine to address some of these issues. When there are other
8782 proposals, for example from Olivier, these will be appended to the end
8783 of this document.
8784
8785 @subheading Definitions and Design Basics
8786
8787 First, @dfn{text} is defined to be a series of characters which together
8788 defines an utterance or partial utterance in some language.
8789 Generally, this language is a human language, but it may also be a
8790 computer language if the computer language uses a representation close
8791 enough to that of human languages for it to also make sense to call its
8792 representation text. Text is opposed to @dfn{binary}, which is a sequence
8793 of bytes, representing machine-readable but not human-readable data.
8794 A @dfn{byte} is merely a number within a predefined range, which nowadays is
8795 nearly always zero to 255. A @dfn{character} is a unit of text. What makes
8796 one character different from another is not always clear-cut. It is
8797 generally related to the appearance of the character, although perhaps
8798 not any possible appearance of that character, but some sort of ideal
8799 appearance that is assigned to a character. Whether two characters
8800 that look very similar are actually the same depends on various
8801 factors such as political ones, such as whether the characters are
8802 used to mean similar sorts of things, or behave similarly in similar
8803 contexts. In any case, it is not always clearly defined whether two
8804 characters are actually the same or not. In practice, however, this
8805 is more or less agreed upon.
8806
8807 A @dfn{character set} is just that, a set of one or more characters.
8808 The set is unique in that there will not be more than one instance of
8809 the same character in a character set, and logically is unordered,
8810 although an order is often imposed or suggested for the characters in
8811 the character set. We can also define an @dfn{order} on a character
8812 set, which is a way of assigning a unique number, or possibly a pair of
8813 numbers, or a triplet of numbers, or even a set of four or more numbers
8814 to each character in the character set. The combination of an order in
8815 the character set results in an @dfn{ordered character set}. In an
8816 ordered character set, there is an upper limit and a lower limit on the
8817 possible values that a character, or that any number within the set of
8818 numbers assigned to a character, can take. However, the lower limit
8819 does not have to start at zero or one, or anywhere else in particular,
8820 nor does the upper limit have to end anywhere particular, and there may
8821 be gaps within these ranges such that particular numbers or sets of
8822 numbers do not have a corresponding character, even though they are
8823 within the upper and lower limits. For example, @dfn{ASCII} defines a
8824 very standard ordered character set. It is normally defined to be 94
8825 characters in the range 33 through 126 inclusive on both ends, with
8826 every possible character within this range being actually present in the
8827 character set.
8828
8829 Sometimes the ASCII character set is extended to include what are called
8830 @dfn{non-printing characters}. Non-printing characters are characters
8831 which instead of really being displayed in a more or less rectangular
8832 block, like all other characters, instead indicate certain functions
8833 typically related to either control of the display upon which the
8834 characters are being displayed, or have some effect on a communications
8835 channel that may be currently open and transmitting characters, or may
8836 change the meaning of future characters as they are being decoded, or
8837 some other similar function. You might say that non-printing characters
8838 are somewhat of a hack because they are a special exception to the
8839 standard concept of a character as being a printed glyph that has some
8840 direct correspondence in the non-computer world.
8841
8842 With non-printing characters in mind, the 94-character ordered character
8843 set called ASCII is often extended into a 96-character ordered character
8844 set, also often called ASCII, which includes in addition to the 94
8845 characters already mentioned, two non-printing characters, one called
8846 space and assigned the number 32, just below the bottom of the previous
8847 range, and another called @dfn{delete} or @dfn{rubout}, which is given
8848 number 127 just above the end of the previous range. Thus to reiterate,
8849 the result is a 96-character ordered character set, whose characters
8850 take the values from 32 to 127 inclusive. Sometimes ASCII is further
8851 extended to contain 32 more non-printing characters, which are given the
8852 numbers zero through 31 so that the result is a 128-character ordered
8853 character set with characters numbered zero through 127, and with many
8854 non-printing characters. Another way to look at this, and the way that
8855 is normally taken by XEmacs MULE, is that the characters that would be
8856 in the range 30 through 31 in the most extended definition of ASCII,
8857 instead form their own ordered character set, which is called
8858 @dfn{control zero}, and consists of 32 characters in the range zero
8859 through 31. A similar ordered character set called @dfn{control one} is
8860 also created, and it contains 32 more non-printing characters in the
8861 range 128 through 159. Note that none of these three ordered character
8862 sets overlaps in any of the numbers they are assigned to their
8863 characters, so they can all be used at once. Note further that the same
8864 character can occur in more than one character set. This was shown
8865 above, for example, in two different ordered character sets we defined,
8866 one of which we could have called @dfn{ASCII}, and the other
8867 @dfn{ASCII-extended}, to show that it had extended by two non-printable
8868 characters. Most of the characters in these two character sets are
8869 shared and present in both of them.
8870
8871 Note that there is no restriction on the size of the character set, or
8872 on the numbers that are assigned to characters in an ordered character
8873 set. It is often extremely useful to represent a sequence of characters
8874 as a sequence of bytes, where a byte as defined above is a number in the
8875 range zero to 255. An @dfn{encoding} does precisely this. It is simply
8876 a mapping from a sequence of characters, possibly augmented with
8877 information indicating the character set that each of these characters
8878 belongs to, to a sequence of bytes which represents that sequence of
8879 characters and no other -- which is to say the mapping is reversible.
8880
8881 A @dfn{coding system} is a set of rules for encoding a sequence of
8882 characters augmented with character set information into a sequence of
8883 bytes, and later performing the reverse operation. It is frequently
8884 possible to group coding systems into classes or types based on common
8885 features. Typically, for example, a particular coding system class
8886 may contain a base coding system which specifies some of the rules,
8887 but leaves the rest unspecified. Individual members of the coding
8888 system class are formed by starting with the base coding system, and
8889 augmenting it with additional rules to produce a particular coding
8890 system, what you might think of as a sort of variation within a
8891 theme.
8892
8893 @subheading XEmacs Specific Definitions
8894
8895 First of all, in XEmacs, the concept of character is a little different
8896 from the general definition given above. For one thing, the character
8897 set that a character belongs to may or may not be an inherent part of
8898 the character itself. In other words, the same character occurring in
8899 two different character sets may appear in XEmacs as two different
8900 characters. This is generally the case now, but we are attempting to
8901 move in the other direction. Different proposals may have different
8902 ideas about exactly the extent to which this change will be carried out.
8903 The general trend, though, is to represent all information about a
8904 character other than the character itself, using text properties
8905 attached to the character. That way two instances of the same character
8906 will look the same to lisp code that merely retrieves the character, and
8907 does not also look at the text properties of that character. Everyone
8908 involved is in agreement in doing it this way with all Latin characters,
8909 and in fact for all characters other than Chinese, Japanese, and Korean
8910 ideographs. For those, there may be a difference of opinion.
8911
8912 A second difference between the general definition of character and the
8913 XEmacs usage of character is that each character is assigned a unique
8914 number that distinguishes it from all other characters in the world, or
8915 at the very least, from all other characters currently existing anywhere
8916 inside the current XEmacs invocation. (If there is a case where the
8917 weaker statement applies, but not the stronger statement, it would
8918 possibly be with composite characters and any other such characters that
8919 are created on the sly.)
8920
8921 This unique number is called the @dfn{character representation} of the
8922 character, and its particular details are a matter of debate. There is
8923 the current standard in use that it is undoubtedly going to change.
8924 What has definitely been agreed upon is that it will be an integer, more
8925 specifically a positive integer, represented with less than or equal to
8926 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit
8927 architecture, with the proviso that any characters that whose
8928 representation would fit in a 64-bit architecture, but not on a 32-bit
8929 architecture, would be used only for composite characters, and others
8930 that would satisfy the weak uniqueness property mentioned above, but not
8931 with the strong uniqueness property.
8932
8933 At this point, it is useful to talk about the different representations
8934 that a sequence of characters can take. The simplest representation is
8935 simply as a sequence of characters, and this is called the @dfn{Lisp
8936 representation} of text, because it is the representation that Lisp
8937 programs see. Other representations include the external
8938 representation, which refers to any encoding of the sequence of
8939 characters, using the definition of encoding mentioned above.
8940 Typically, text in the external representation is used outside of
8941 XEmacs, for example in files, e-mail messages, web sites, and the like.
8942 Another representation for a sequence of characters is what I will call
8943 the @dfn{byte representation}, and it represents the way that XEmacs
8944 internally represents text in a buffer, or in a string. Potentially,
8945 the representation could be different between a buffer and a string, and
8946 then the terms @dfn{buffer byte representation} and @dfn{string byte
8947 representation} would be used, but in practice I don't think this will
8948 occur. It will be possible, of course, for buffers and strings, or
8949 particular buffers and particular strings, to contain different
8950 sub-representations of a single representation. For example, Olivier's
8951 1-2-4 proposal allows for three sub-representations of his internal byte
8952 representation, allowing for 1 byte, 2 bytes, and 4 byte width
8953 characters respectively. A particular string may be in one
8954 sub-representation, and a particular buffer in another
8955 sub-representation, but overall both are following the same byte
8956 representation. I do not use the term @dfn{internal representation}
8957 here, as many people have, because it is potentially ambiguous.
8958
8959 Another representation is called the @dfn{array of characters
8960 representation}. This is a representation on the C-level in which the
8961 sequence of text is represented, not using the byte representation, but
8962 by using an array of characters, each represented using the character
8963 representation. This sort of representation is often used by redisplay
8964 because it is more convenient to work with than any of the other
8965 internal representations.
8966
8967 The term @dfn{binary representation} may also be heard. Binary
8968 representation is used to represent binary data. When binary data is
8969 represented in the lisp representation, an equivalence is simply set up
8970 between bytes zero through 255, and characters zero through 255. These
8971 characters come from four character sets, which are from bottom to top,
8972 control zero, ASCII, control 1, and Latin 1. Together, they comprise
8973 256 characters, and are a good mapping for the 256 possible bytes in a
8974 binary representation. Binary representation could also be used to
8975 refer to an external representation of the binary data, which is a
8976 simple direct byte-to-byte representation. No internal representation
8977 should ever be referred to as a binary representation because of
8978 ambiguity. The terms character set/encoding system were defined
8979 generally, above. In XEmacs, the equivalent concepts exist, although
8980 character set has been shortened to charset, and in fact represents
8981 specifically an ordered character set. For each possible charset, and
8982 for each possible coding system, there is an associated object in
8983 XEmacs. These objects will be of type charset and coding system,
8984 respectively. Charsets and coding systems are divided into classes, or
8985 @dfn{types}, the normal term under XEmacs, and all possible charsets
8986 encoding systems that may be defined must be in one of these types. If
8987 you need to create a charset or coding system that is not one of these
8988 types, you will have to modify the C code to support this new type.
8989 Some of the existing or soon-to-be-created types are, or will be,
8990 generic enough so that this shouldn't be an issue. Note also that the
8991 byte encoding for text and the character coding of a character are
8992 closely related. You might say that ideally each is the simplest
8993 equivalent of the other given the general constraints on each
8994 representation.
8995
8996 To be specific, in the current MULE representation,
7256 8997
7257 @enumerate 8998 @enumerate
7258 @item 8999 @item
7259 the @code{staticpro}'ed variables (there is a special 9000 Characters encode both the character itself and the character set
7260 @code{staticpro_nodump()} call for protected variables we do not want to 9001 that it comes from. These character sets are always assumed to be
7261 dump). 9002 representable as an ordered character set of size 96 or of size 96
7262 9003 by 96, or the trivially-related sizes 94 and 94 by 94. The only
7263 @item 9004 allowable exceptions are the control zero and control one character
7264 the Lisp_Object variables registered via @code{dump_add_root_lisp_object} 9005 sets, which are of size 32. Character sets which do not naturally
7265 (@code{staticpro()} is equivalent to @code{staticpro_nodump()} + 9006 have a compatible ordering such as this are shoehorned into an
7266 @code{dump_add_root_lisp_object()}). 9007 ordered character set, or possibly two ordered character sets of a
7267 9008 compatible size.
7268 @item 9009 @item
7269 the data-segment memory blocks registered via @code{dump_add_root_block} 9010 The variable width byte representation was deliberately chosen to
7270 (for blocks with relocatable pointers), or @code{dump_add_opaque} (for 9011 allow scanning text forwards and backwards efficiently. This
7271 "opaque" blocks with no relocatable pointers; this is just a shortcut 9012 necessitated defining the possible bytes into three ranges which
7272 for calling @code{dump_add_root_block} with a NULL description). 9013 we shall call A, B, and C. Range A is used exclusively for
7273 9014 single-byte characters, which is to say characters that are
7274 @item 9015 representing using only one contiguous byte. Multi-byte
7275 the pointer variables registered via @code{dump_add_root_block_ptr}, 9016 characters are always represented by using one byte from Range B,
7276 each of which points to a block of heap memory (generally a C structure 9017 followed by one or more bytes from Range C. What this means is
7277 or array). Note that @code{dump_add_root_block_ptr} is not technically 9018 that bytes that begin a character are unequivocally distinguished
7278 necessary, as a pointer variable can be seen as a special case of a 9019 from bytes that do not begin a character, and therefore there is
7279 data-segment memory block and registered using 9020 never a problem scaling backwards and finding the beginning of a
7280 @code{dump_add_root_block}. Doing it this way, however, would require 9021 character. Know that UTF8 adopts a proposal that is very similar
7281 another level of static structures declared. Since pointer variables 9022 in spirit in that it uses separate ranges for the first byte of a
7282 are quite common, @code{dump_add_root_block_ptr} is provided for 9023 multi byte sequence, and the following bytes in multi-byte
7283 convenience. Note also that internally we have to treat it separately 9024 sequence.
7284 from @code{dump_add_root_block} rather than writing the former as a call 9025 @item
7285 to the latter, since we don't have support for creating and using memory 9026 Given the fact that all ordered character sets allowed were
7286 descriptions on the fly -- they must all be statically declared in the 9027 essentially 96 characters per dimension, it made perfect sense to
7287 data-segment. 9028 make Range C comprise 96 bytes. With a little more tweaking, the
9029 currently-standard MULE byte representation was created, and was
9030 drafted from this.
9031 @item
9032 The MULE byte representation defined four basic representations for
9033 characters, which would take up from one to four bytes,
9034 respectively. The MULE character representation thus had the
9035 following constraints:
9036 @enumerate
9037 @item
9038 Character numbers zero through 255 should represent the
9039 characters that binary values zero through 255 would be
9040 mapped onto. (Note: this was not the case in Kenichi Handa's
9041 version of this representation, but I changed it.)
9042 @item
9043 The four sub-classes of representation in the MULE byte
9044 representation should correspond to four contiguous
9045 non-overlapping ranges of characters.
9046 @item
9047 The algorithmic conversion between the single character
9048 represented in the byte representation and in the character
9049 representation should be as easy as possible.
9050 @item
9051 Given the previous constraints, the character representation
9052 should be as compact as possible, which is to say it should
9053 use the least number of bits possible.
7288 @end enumerate 9054 @end enumerate
7289 9055 @end enumerate
7290 This does not include the GCPRO'ed variables, the specbinds, the 9056
7291 catchtags, the backlist, the redisplay or the profiling info, since we 9057 So you see that the entire structure of the byte and character
7292 do not want to rebuild the actual chain of lisp calls which end up to 9058 representations stemmed from a very small number of basic choices,
7293 the dump-emacs call, only the global variables. 9059 which were
7294
7295 Weak lists and weak hash tables are dumped as if they were their
7296 non-weak equivalent (without changing their type, of course). This has
7297 not yet been a problem.
7298
7299 @node Address allocation, The header, Object inventory, Dumping phase
7300 @subsection Address allocation
7301 @cindex dumping address allocation
7302
7303
7304 The next step is to allocate the offsets of each of the objects in the
7305 final dump file. This is done by @code{pdump_allocate_offset()} which
7306 is called indirectly by @code{pdump_scan_by_alignment()}.
7307
7308 The strategy to deal with alignment problems uses these facts:
7309 9060
7310 @enumerate 9061 @enumerate
7311 @item 9062 @item
7312 real world alignment requirements are powers of two. 9063 the choice to encode character set information in a character
7313 9064 @item
7314 @item 9065 the choice to assume that all character sets would have an order
7315 the C compiler is required to adjust the size of a struct so that you 9066 imposed upon them with 96 characters per one or two
7316 can have an array of them next to each other. This means you can have an 9067 dimensions. (This is less arbitrary than it seems--it follows
7317 upper bound of the alignment requirements of a given structure by 9068 ISO-2022)
7318 looking at which power of two its size is a multiple. 9069 @item
7319 9070 the choice to use a variable width byte representation.
7320 @item
7321 the non-variant part of variable size lrecords has an alignment
7322 requirement of 4.
7323 @end enumerate 9071 @end enumerate
7324 9072
7325 Hence, for each lrecord type, C struct type or opaque data block the 9073 What this means is that you cannot really separate the byte
7326 alignment requirement is computed as a power of two, with a minimum of 9074 representation, the character representation, and the assumptions made
7327 2^2 for lrecords. @code{pdump_scan_by_alignment()} then scans all the 9075 about characters and whether they represent character sets from each
7328 @code{pdump_block_list_elmt}'s, the ones with the highest requirements 9076 other. All of these are closely intertwined, and for purposes of
7329 first. This ensures the best packing. 9077 simplicity, they should be designed together. If you change one
7330 9078 representation without changing another, you are in essence creating a
7331 The maximum alignment requirement we take into account is 2^8. 9079 completely new design with its own attendant problems--since your new
7332 9080 design is likely to be quite complex and not very coherent with
7333 @code{pdump_allocate_offset()} only has to do a linear allocation, 9081 regards to the translation between the character and byte
7334 starting at offset 256 (this leaves room for the header and keeps the 9082 representations, you are likely to run into problems.
7335 alignments happy). 9083
7336 9084 @node Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #4, Introduction to Multilingual Issues #2, Multilingual Support
7337 @node The header, Data dumping, Address allocation, Dumping phase 9085 @section Introduction to Multilingual Issues #3
7338 @subsection The header 9086 @cindex introduction to multilingual issues #3
7339 @cindex dumping, the header 9087
7340 9088 In XEmacs, Mule is a code word for the support for input handling and
7341 The next step creates the file and writes a header with a signature and 9089 display of multi-lingual text. This section provides an overview of how
7342 some random information in it. The @code{reloc_address} field, which 9090 this support impacts the C and Lisp code in XEmacs. It is important for
7343 indicates at which address the file should be loaded if we want to avoid 9091 anyone who works on the C or the Lisp code, especially on the C code, to
7344 post-reload relocation, is set to 0. It then seeks to offset 256 (base 9092 be aware of these issues, even if they don't work directly on code that
7345 offset for the objects). 9093 implements multi-lingual features, because there are various general
7346 9094 procedures that need to be followed in order to write Mule-compliant
7347 @node Data dumping, Pointers dumping, The header, Dumping phase 9095 code. (The specifics of these procedures are documented elsewhere in
7348 @subsection Data dumping 9096 this manual.)
7349 @cindex data dumping 9097
7350 @cindex dumping, data 9098 There are four primary aspects of Mule support:
7351
7352 The data is dumped in the same order as the addresses were allocated by
7353 @code{pdump_dump_data()}, called from @code{pdump_scan_by_alignment()}.
7354 This function copies the data to a temporary buffer, relocates all
7355 pointers in the object to the addresses allocated in step Address
7356 Allocation, and writes it to the file. Using the same order means that,
7357 if we are careful with lrecords whose size is not a multiple of 4, we
7358 are ensured that the object is always written at the offset in the file
7359 allocated in step Address Allocation.
7360
7361 @node Pointers dumping, , Data dumping, Dumping phase
7362 @subsection Pointers dumping
7363 @cindex pointers dumping
7364 @cindex dumping, pointers
7365
7366 A bunch of tables needed to reassign properly the global pointers are
7367 then written. They are:
7368 9099
7369 @enumerate 9100 @enumerate
7370 @item 9101 @item
7371 the pdump_root_block_ptrs dynarr 9102 internal handling and representation of multi-lingual text.
7372 @item 9103 @item
7373 the pdump_opaques dynarr 9104 conversion between the internal representation of text and the various
7374 @item 9105 external representations in which multi-lingual text is encoded, such as
7375 a vector of all the offsets to the objects in the file that include a 9106 Unicode representations (including mostly fixed width encodings such as
7376 description (for faster relocation at reload time) 9107 UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings,
7377 @item 9108 such as UTF-7 and UTF-8); the various ISO2022 representations, which
7378 the pdump_root_objects and pdump_weak_object_chains dynarrs. 9109 typically use escape sequences to switch between different character
9110 sets (such as Compound Text, used under X Windows; JIS, used
9111 specifically for encoding Japanese; and EUC, a non-modal encoding used
9112 for Japanese, Korean, and certain other languages); Microsoft's
9113 multi-byte encodings (such as Shift-JIS); various simple encodings for
9114 particular 8-bit character sets (such as Latin-1 and Latin-2, and
9115 encodings (such as koi8 and Alternativny) for Cyrillic); and others.
9116 This conversion needs to happen both for text in files and text sent to
9117 or retrieved from system API calls. It even needs to happen for
9118 external binary data because the internal representation does not
9119 represent binary data simply as a sequence of bytes as it is represented
9120 externally.
9121 @item
9122 Proper display of multi-lingual characters.
9123 @item
9124 Input of multi-lingual text using the keyboard.
7379 @end enumerate 9125 @end enumerate
7380 9126
7381 For each of the dynarrs we write both the pointer to the variables and 9127 These four aspects are for the most part independent of each other.
7382 the relocated offset of the object they point to. Since these variables 9128
7383 are global, the pointers are still valid when restarting the program and 9129 @subheading Characters, Character Sets, and Encodings
7384 are used to regenerate the global pointers. 9130
7385 9131 A @dfn{character} (which is, BTW, a surprisingly complex concept) is, in
7386 The @code{pdump_weak_object_chains} dynarr is a special case. The 9132 a written representation of text, the most basic written unit that has a
7387 variables it points to are the head of weak linked lists of lisp objects 9133 meaning of its own. It's comparable to a phoneme when analyzing words
7388 of the same type. Not all objects of this list are dumped so the 9134 in spoken speech (for example, the sound of @samp{t} in English, which
7389 relocated pointer we associate with them points to the first dumped 9135 in fact has different pronunciations in different words -- aspirated in
7390 object of the list, or Qnil if none is available. This is also the 9136 @samp{time}, unaspirated in @samp{stop}, unreleased or even pronounced
7391 reason why they are not used as roots for the purpose of object 9137 as a glottal stop in @samp{button}, etc. -- but logically is a single
7392 enumeration. 9138 concept). Like a phoneme, a character is an abstract concept defined by
7393 9139 its @emph{meaning}. The character @samp{lowercase f}, for example, can
7394 Some very important information like the @code{staticpros} and 9140 always be used to represent the first letter in the word @samp{fill},
7395 @code{lrecord_implementations_table} are handled indirectly using 9141 regardless of whether it's drawn upright or italic, whether the
7396 @code{dump_add_opaque} or @code{dump_add_root_block_ptr}. 9142 @samp{fi} combination is drawn as a single ligature, whether there are
7397 9143 serifs on the bottom of the vertical stroke, etc. (These different
7398 This is the end of the dumping part. 9144 appearances of a single character are often called @dfn{graphs} or
7399 9145 @dfn{glyphs}.) Our concern when representing text is on representing the
7400 @node Reloading phase, Remaining issues, Dumping phase, Dumping 9146 abstract characters, and not on their exact appearance.
7401 @section Reloading phase 9147
7402 @cindex reloading phase 9148 A @dfn{character set} (or @dfn{charset}), as we define it, is a set of
7403 @cindex dumping, reloading phase 9149 characters, each with an associated number (or set of numbers -- see
7404 9150 below), called a @dfn{code point}. It's important to understand that a
7405 @subsection File loading 9151 character is not defined by any number attached to it, but by its
7406 @cindex dumping, file loading 9152 meaning. For example, ASCII and EBCDIC are two charsets containing
7407 9153 exactly the same characters (lowercase and uppercase letters, numbers 0
7408 The file is mmap'ed in memory (which ensures a PAGESIZE alignment, at 9154 through 9, particular punctuation marks) but with different
7409 least 4096), or if mmap is unavailable or fails, a 256-bytes aligned 9155 numberings. The `comma' character in ASCII and EBCDIC, for instance, is
7410 malloc is done and the file is loaded. 9156 the same character despite having a different numbering. Conversely,
7411 9157 when comparing ASCII and JIS-Roman, which look the same except that the
7412 Some variables are reinitialized from the values found in the header. 9158 latter has a yen sign substituted for the backslash, we would say that
7413 9159 the backslash and yen sign are @strong{not} the same characters, despite having
7414 The difference between the actual loading address and the reloc_address 9160 the same number (95) and despite the fact that all other characters are
7415 is computed and will be used for all the relocations. 9161 present in both charsets, with the same numbering. ASCII and JIS-Roman,
7416 9162 then, do @emph{not} have exactly the same characters in them (ASCII has
7417 9163 a backslash character but no yen-sign character, and vice-versa for
7418 @subsection Putting back the pdump_opaques 9164 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
7419 @cindex dumping, putting back the pdump_opaques 9165 and JIS-Roman are closer.
7420 9166
7421 The memory contents are restored in the obvious and trivial way. 9167 It's also important to distinguish between charsets and encodings. For
7422 9168 a simple charset like ASCII, there is only one encoding normally used --
7423 9169 each character is represented by a single byte, with the same value as
7424 @subsection Putting back the pdump_root_block_ptrs 9170 its code point. For more complicated charsets, however, things are not
7425 @cindex dumping, putting back the pdump_root_block_ptrs 9171 so obvious. Unicode version 2, for example, is a large charset with
7426 9172 thousands of characters, each indexed by a 16-bit number, often
7427 The variables pointed to by pdump_root_block_ptrs in the dump phase are 9173 represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One
7428 reset to the right relocated object addresses. 9174 obvious encoding uses two bytes per character (actually two encodings,
7429 9175 depending on which of the two possible byte orderings is chosen). This
7430 9176 encoding is convenient for internal processing of Unicode text; however,
7431 @subsection Object relocation 9177 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is
7432 @cindex dumping, object relocation 9178 usually used for external text, for example files or e-mail. UTF-8
7433 9179 represents Unicode characters with one to three bytes (often extended to
7434 All the objects are relocated using their description and their offset 9180 six bytes to handle characters with up to 31-bit indices). Unicode
7435 by @code{pdump_reloc_one}. This step is unnecessary if the 9181 characters 00 to 7F (identical with ASCII) are directly represented with
7436 reloc_address is equal to the file loading address. 9182 one byte, and other characters with two or more bytes, each in the range
7437 9183 80 to FF.
7438 9184
7439 @subsection Putting back the pdump_root_objects and pdump_weak_object_chains 9185 In general, a single encoding may be able to represent more than one
7440 @cindex dumping, putting back the pdump_root_objects and pdump_weak_object_chains 9186 charset.
7441 9187
7442 Same as Putting back the pdump_root_block_ptrs. 9188 @subheading Internal Representation of Text
7443 9189
7444 9190 In an ASCII or single-European-character-set world, life is very simple.
7445 @subsection Reorganize the hash tables 9191 There are 256 characters, and each character is represented using the
7446 @cindex dumping, reorganize the hash tables 9192 numbers 0 through 255, which fit into a single byte. With a few
7447 9193 exceptions (such as case-changing operations or syntax classes like
7448 Since some of the hash values in the lisp hash tables are 9194 'whitespace'), "text" is simply an array of indices into a font. You
7449 address-dependent, their layout is now wrong. So we go through each of 9195 can get different languages simply by choosing fonts with different
7450 them and have them resorted by calling @code{pdump_reorganize_hash_table}. 9196 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and
7451 9197 everything will "just work" as long as anyone else receiving your text
7452 @node Remaining issues, , Reloading phase, Dumping 9198 uses a compatible font.
7453 @section Remaining issues 9199
7454 @cindex dumping, remaining issues 9200 In the multi-lingual world, however, it is much more complicated. There
7455 9201 are a great number of different characters which are organized in a
7456 The build process will have to start a post-dump xemacs, ask it the 9202 complex fashion into various character sets. The representation to use
7457 loading address (which will, hopefully, be always the same between 9203 is not obvious because there are issues of size versus speed to
7458 different xemacs invocations) [[unfortunately, not true on Linux with 9204 consider. In fact, there are in general two kinds of representations to
7459 the ExecShield feature]] and relocate the file to the new address. 9205 work with: one that represents a single character using an integer
7460 This way the object relocation phase will not have to be done, which 9206 (possibly a byte), and the other representing a single character as a
7461 means no writes in the objects and that, because of the use of mmap, the 9207 sequence of bytes. The former representation is normally called fixed
7462 dumped data will be shared between all the xemacs running on the 9208 width, and the other variable width. Both representations represent
7463 computer. 9209 exactly the same characters, and the conversion from one representation
7464 9210 to the other is governed by a specific formula (rather than by table
7465 Some executable signature will be necessary to ensure that a given dump 9211 lookup) but it may not be simple. Most C code need not, and in fact
7466 file is really associated with a given executable, or random crashes 9212 should not, know the specifics of exactly how the representations work.
7467 will occur. Maybe a random number set at compile or configure time thru 9213 In fact, the code must not make assumptions about the representations.
7468 a define. This will also allow for having differently-compiled xemacsen 9214 This means in particular that it must use the proper macros for
7469 on the same system (mule and no-mule comes to mind). 9215 retrieving the character at a particular memory location, determining
7470 9216 how many characters are present in a particular stretch of text, and
7471 The DOC file contents should probably end up in the dump file. 9217 incrementing a pointer to a particular character to point to the
7472 9218 following character, and so on. It must not assume that one character
7473 9219 is stored using one byte, or even using any particular number of bytes.
7474 @node Events and the Event Loop, Asynchronous Events; Quit Checking, Dumping, Top 9220 It must not assume that the number of characters in a stretch of text
9221 bears any particular relation to a number of bytes in that stretch. It
9222 must not assume that the character at a particular memory location can
9223 be retrieved simply by dereferencing the memory location, even if a
9224 character is known to be ASCII or is being compared with an ASCII
9225 character, etc. Careful coding is required to be Mule clean. The
9226 biggest work of adding Mule support, in fact, is converting all of the
9227 existing code to be Mule clean.
9228
9229 Lisp code is mostly unaffected by these concerns. Text in strings and
9230 buffers appears simply as a sequence of characters regardless of
9231 whether Mule support is present. The biggest difference with older
9232 versions of Emacs, as well as current versions of GNU Emacs, is that
9233 integers and characters are no longer equivalent, but are separate
9234 Lisp Object types.
9235
9236 @subheading Conversion Between Internal and External Representations
9237
9238 All text needs to be converted to an external representation before being
9239 sent to a function or file, and all text retrieved from a function of
9240 file needs to be converted to the internal representation. This
9241 conversion needs to happen as close to the source or destination of the
9242 text as possible. No operations should ever be performed on text encoded
9243 in an external representation other than simple copying, because no
9244 assumptions can reliably be made about the format of this text. You
9245 cannot assume, for example, that the end of text is terminated by a null
9246 byte. (For example, if the text is Unicode, it will have many null bytes
9247 in it.) You cannot find the next "slash" character by searching through
9248 the bytes until you find a byte that looks like a "slash" character,
9249 because it might actually be the second byte of a Kanji character.
9250 Furthermore, all text in the internal representation must be converted,
9251 even if it is known to be completely ASCII, because the external
9252 representation may not be ASCII compatible (for example, if it is
9253 Unicode).
9254
9255 The place where C code needs to be the most careful is when calling
9256 external API functions. It is easy to forget that all text passed to or
9257 retrieved from these functions needs to be converted. This includes text
9258 in structures passed to or retrieved from these functions and all text
9259 that is passed to a callback function that is called by the system.
9260
9261 Macros are provided to perform conversions to or from external text.
9262 These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT
9263 respectively. These macros accept input in various forms, for example,
9264 Lisp strings, buffers, lstreams, raw data, and can return data in
9265 multiple formats, including both @code{malloc()}ed and @code{alloca()}ed data. The use
9266 of @code{alloca()}ed data here is particularly important because, in general,
9267 the returned data will not be used after making the API call, and as a
9268 result, using @code{alloca()}ed data provides a very cheap and easy to use
9269 method of allocation.
9270
9271 These macros take a coding system argument which indicates the nature of
9272 the external encoding. A coding system is an object that encapsulates
9273 the structures of a particular external encoding and the methods required
9274 to convert to and from this encoding. A facility exists to create coding
9275 system aliases, which in essence gives a single coding system two
9276 different names. It is effectively used in XEmacs to provide a layer of
9277 abstraction on top of the actual coding systems. For example, the coding
9278 system alias "file-name" points to whichever coding system is currently
9279 used for encoding and decoding file names as passed to or retrieved from
9280 system calls. In general, the actual encoding will differ from system to
9281 system, and also on the particular locale that the user is in. The use
9282 of the file-name alias effectively hides that implementation detail on
9283 top of that abstract interface layer which provides a unified set of
9284 coding systems which are consistent across all operating environments.
9285
9286 The choice of which coding system to use in a particular conversion macro
9287 requires some thought. In general, you should choose a lower-level
9288 actual coding system when the very design of the APIs you are working
9289 with call for that particular coding system. In all other cases, you
9290 should find the least general abstract coding system (i.e. coding system
9291 alias) that applies to your specific situation. Only use the most
9292 general coding systems, such as native, when there is simply nothing else
9293 that is more appropriate. By doing things this way, you allow the user
9294 more control over how the encoding actually works, because the user is
9295 free to map the abstracted coding system names onto to different actual
9296 coding systems.
9297
9298 Some common coding systems are:
9299
9300 @table @code
9301 @item ctext
9302 Compound Text, which is the standard encoding under X Windows, which is
9303 used for clipboard data and possibly other data. (ctext is a coding
9304 system of type ISO2022.)
9305
9306 @item mswindows-unicode
9307 this is used for representing text passed to MS Window API calls with
9308 arguments that need to be in Unicode format. (mswindows-unicode is a
9309 coding system of type UTF-16)
9310
9311 @item ms-windows-multi-byte
9312 this is used for representing text passed to MS Windows API calls with
9313 arguments that need to be in multi-byte format. Note that there are
9314 very few if any examples of such calls.
9315
9316 @item mswindows-tstr
9317 this is used for representing text passed to any MS Windows API calls
9318 that declare their argument as LPTSTR, or LPCTSTR. This is the vast
9319 majority of system calls and automatically translates either to
9320 mswindows-unicode or mswindows-multi-byte, depending on the presence or
9321 absence of the UNICODE preprocessor constant. (If we compile XEmacs
9322 with this preprocessor constant, then all API calls use Unicode for all
9323 text passed to or received from these API calls.)
9324
9325 @item terminal
9326 used for text sent to or read from a text terminal in the absence of a
9327 more specific coding system (calls to window-system specific APIs should
9328 use the appropriate window-specific coding system if it makes sense to
9329 do so.)
9330
9331 @item file-name
9332 used when specifying the names of files in the absence of a more
9333 specific encoding, such as ms-windows-tstr.
9334
9335 @item native
9336 the most general coding system for specifying text passed to system
9337 calls. This generally translates to whatever coding system is specified
9338 by the current locale. This should only be used when none of the coding
9339 systems mentioned above are appropriate.
9340 @end table
9341
9342 @subheading Proper Display of Multilingual Text
9343
9344 There are two things required to get this working correctly. One is
9345 selecting the correct font, and the other is encoding the text according
9346 to the encoding used for that specific font, or the window-system
9347 specific text display API. Generally each separate character set has a
9348 different font associated with it, which is specified by name and each
9349 font has an associated encoding into which the characters must be
9350 translated. (this is the case on X Windows, at least; on Windows there
9351 is a more general mechanism). Both the specific font for a charset and
9352 the encoding of that font are system dependent. Currently there is a
9353 way of specifying these two properties under X Windows (using the
9354 registry and ccl properties of a character set) but not for other window
9355 systems. A more general system needs to be implemented to allow these
9356 characteristics to be specified for all Windows systems.
9357
9358 Another issue is making sure that the necessary fonts for displaying
9359 various character sets are installed on the system. Currently, XEmacs
9360 provides, on its web site, X Windows fonts for a number of different
9361 character sets that can be installed by users. This isn't done yet for
9362 Windows, but it should be.
9363
9364 @subheading Inputting of Multilingual Text
9365
9366 This is a rather complicated issue because there are many paradigms
9367 defined for inputting multi-lingual text, some of which are specific to
9368 particular languages, and any particular language may have many
9369 different paradigms defined for inputting its text. These paradigms are
9370 encoded in input methods and there is a standard API for defining an
9371 input method in XEmacs called LEIM, or Library of Emacs Input Methods.
9372 Some of these input methods are written entirely in Elisp, and thus are
9373 system-independent, while others require the aid either of an external
9374 process, or of C level support that ties into a particular
9375 system-specific input method API, for example, XIM under X Windows, or
9376 the active keyboard layout and IME support under Windows. Currently,
9377 there is no support for any system-specific input methods under
9378 Microsoft Windows, although this will change.
9379
9380 @node Introduction to Multilingual Issues #4, Character Sets, Introduction to Multilingual Issues #3, Multilingual Support
9381 @section Introduction to Multilingual Issues #4
9382 @cindex introduction to multilingual issues #4
9383
9384 The rest of the sections in this chapter consist of yet another
9385 introduction to multilingual issues, duplicating the information in the
9386 previous sections.
9387
9388 @node Character Sets, Encodings, Introduction to Multilingual Issues #4, Multilingual Support
9389 @section Character Sets
9390 @cindex character sets
9391
9392 A @dfn{character set} (or @dfn{charset}) is an ordered set of
9393 characters. A particular character in a charset is indexed using one or
9394 more @dfn{position codes}, which are non-negative integers. The number
9395 of position codes needed to identify a particular character in a charset
9396 is called the @dfn{dimension} of the charset. In XEmacs/Mule, all
9397 charsets have dimension 1 or 2, and the size of all charsets (except for
9398 a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range
9399 of position codes used to index characters from any of these types of
9400 character sets is as follows:
9401
9402 @example
9403 Charset type Position code 1 Position code 2
9404 ------------------------------------------------------------
9405 94 33 - 126 N/A
9406 96 32 - 127 N/A
9407 94x94 33 - 126 33 - 126
9408 96x96 32 - 127 32 - 127
9409 @end example
9410
9411 Note that in the above cases position codes do not start at an
9412 expected value such as 0 or 1. The reason for this will become clear
9413 later.
9414
9415 For example, Latin-1 is a 96-character charset, and JISX0208 (the
9416 Japanese national character set) is a 94x94-character charset.
9417
9418 [Note that, although the ranges above define the @emph{valid} position
9419 codes for a charset, some of the slots in a particular charset may in
9420 fact be empty. This is the case for JISX0208, for example, where (e.g.)
9421 all the slots whose first position code is in the range 118 - 127 are
9422 empty.]
9423
9424 There are three charsets that do not follow the above rules. All of
9425 them have one dimension, and have ranges of position codes as follows:
9426
9427 @example
9428 Charset name Position code 1
9429 ------------------------------------
9430 ASCII 0 - 127
9431 Control-1 0 - 31
9432 Composite 0 - some large number
9433 @end example
9434
9435 (The upper bound of the position code for composite characters has not
9436 yet been determined, but it will probably be at least 16,383).
9437
9438 ASCII is the union of two subsidiary character sets: Printing-ASCII
9439 (the printing ASCII character set, consisting of position codes 33 -
9440 126, like for a standard 94-character charset) and Control-ASCII (the
9441 non-printing characters that would appear in a binary file with codes 0
9442 - 32 and 127).
9443
9444 Control-1 contains the non-printing characters that would appear in a
9445 binary file with codes 128 - 159.
9446
9447 Composite contains characters that are generated by overstriking one
9448 or more characters from other charsets.
9449
9450 Note that some characters in ASCII, and all characters in Control-1,
9451 are @dfn{control} (non-printing) characters. These have no printed
9452 representation but instead control some other function of the printing
9453 (e.g. TAB or 8 moves the current character position to the next tab
9454 stop). All other characters in all charsets are @dfn{graphic}
9455 (printing) characters.
9456
9457 When a binary file is read in, the bytes in the file are assigned to
9458 character sets as follows:
9459
9460 @example
9461 Bytes Character set Range
9462 --------------------------------------------------
9463 0 - 127 ASCII 0 - 127
9464 128 - 159 Control-1 0 - 31
9465 160 - 255 Latin-1 32 - 127
9466 @end example
9467
9468 This is a bit ad-hoc but gets the job done.
9469
9470 @node Encodings, Internal Mule Encodings, Character Sets, Multilingual Support
9471 @section Encodings
9472 @cindex encodings, Mule
9473 @cindex Mule encodings
9474
9475 An @dfn{encoding} is a way of numerically representing characters from
9476 one or more character sets. If an encoding only encompasses one
9477 character set, then the position codes for the characters in that
9478 character set could be used directly. This is not possible, however, if
9479 more than one character set is to be used in the encoding.
9480
9481 For example, the conversion detailed above between bytes in a binary
9482 file and characters is effectively an encoding that encompasses the
9483 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit
9484 bytes.
9485
9486 Thus, an encoding can be viewed as a way of encoding characters from a
9487 specified group of character sets using a stream of bytes, each of which
9488 contains a fixed number of bits (but not necessarily 8, as in the common
9489 usage of ``byte'').
9490
9491 Here are descriptions of a couple of common
9492 encodings:
9493
9494 @menu
9495 * Japanese EUC (Extended Unix Code)::
9496 * JIS7::
9497 @end menu
9498
9499 @node Japanese EUC (Extended Unix Code), JIS7, Encodings, Encodings
9500 @subsection Japanese EUC (Extended Unix Code)
9501 @cindex Japanese EUC (Extended Unix Code)
9502 @cindex EUC (Extended Unix Code), Japanese
9503 @cindex Extended Unix Code, Japanese EUC
9504
9505 This encompasses the character sets Printing-ASCII, Katakana-JISX0201
9506 (half-width katakana, the right half of JISX0201), Japanese-JISX0208,
9507 and Japanese-JISX0212.
9508
9509 Note that Printing-ASCII and Katakana-JISX0201 are 94-character
9510 charsets, while Japanese-JISX0208 and Japanese-JISX0212 are
9511 94x94-character charsets.
9512
9513 The encoding is as follows:
9514
9515 @example
9516 Character set Representation (PC=position-code)
9517 ------------- --------------
9518 Printing-ASCII PC1
9519 Katakana-JISX0201 0x8E | PC1 + 0x80
9520 Japanese-JISX0208 PC1 + 0x80 | PC2 + 0x80
9521 Japanese-JISX0212 PC1 + 0x80 | PC2 + 0x80
9522 @end example
9523
9524 Note that there are other versions of EUC for other Asian languages.
9525 EUC in general is characterized by
9526
9527 @enumerate
9528 @item
9529 row-column encoding,
9530 @item
9531 big-endian (row-first) ordering, and
9532 @item
9533 ASCII compatibility in variable width forms.
9534 @end enumerate
9535
9536 @node JIS7, , Japanese EUC (Extended Unix Code), Encodings
9537 @subsection JIS7
9538 @cindex JIS7
9539
9540 This encompasses the character sets Printing-ASCII,
9541 Latin-JISX0201 (the left half of JISX0201; this character set
9542 is very similar to Printing-ASCII and is a 94-character charset),
9543 Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes.
9544
9545 Unlike EUC, this is a @dfn{modal} encoding, which means that there are
9546 multiple states that the encoding can be in, which affect how the bytes
9547 are to be interpreted. Special sequences of bytes (called @dfn{escape
9548 sequences}) are used to change states.
9549
9550 The encoding is as follows:
9551
9552 @example
9553 Character set Representation (PC=position-code)
9554 ------------- --------------
9555 Printing-ASCII PC1
9556 Latin-JISX0201 PC1
9557 Katakana-JISX0201 PC1
9558 Japanese-JISX0208 PC1 | PC2
9559
9560
9561 Escape sequence ASCII equivalent Meaning
9562 --------------- ---------------- -------
9563 0x1B 0x28 0x4A ESC ( J invoke Latin-JISX0201
9564 0x1B 0x28 0x49 ESC ( I invoke Katakana-JISX0201
9565 0x1B 0x24 0x42 ESC $ B invoke Japanese-JISX0208
9566 0x1B 0x28 0x42 ESC ( B invoke Printing-ASCII
9567 @end example
9568
9569 Initially, Printing-ASCII is invoked.
9570
9571 @node Internal Mule Encodings, Byte/Character Types; Buffer Positions; Other Typedefs, Encodings, Multilingual Support
9572 @section Internal Mule Encodings
9573 @cindex internal Mule encodings
9574 @cindex Mule encodings, internal
9575 @cindex encodings, internal Mule
9576
9577 In XEmacs/Mule, each character set is assigned a unique number, called a
9578 @dfn{leading byte}. This is used in the encodings of a character.
9579 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has
9580 a leading byte of 0), although some leading bytes are reserved.
9581
9582 Charsets whose leading byte is in the range 0x80 - 0x9F are called
9583 @dfn{official} and are used for built-in charsets. Other charsets are
9584 called @dfn{private} and have leading bytes in the range 0xA0 - 0xFF;
9585 these are user-defined charsets.
9586
9587 More specifically:
9588
9589 @example
9590 Character set Leading byte
9591 ------------- ------------
9592 ASCII 0 (0x7F in arrays indexed by leading byte)
9593 Composite 0x8D
9594 Dimension-1 Official 0x80 - 0x8C/0x8D
9595 (0x8E is free)
9596 Control 0x8F
9597 Dimension-2 Official 0x90 - 0x99
9598 (0x9A - 0x9D are free)
9599 Dimension-1 Private Marker 0x9E
9600 Dimension-2 Private Marker 0x9F
9601 Dimension-1 Private 0xA0 - 0xEF
9602 Dimension-2 Private 0xF0 - 0xFF
9603 @end example
9604
9605 There are two internal encodings for characters in XEmacs/Mule. One is
9606 called @dfn{string encoding} and is an 8-bit encoding that is used for
9607 representing characters in a buffer or string. It uses 1 to 4 bytes per
9608 character. The other is called @dfn{character encoding} and is a 19-bit
9609 encoding that is used for representing characters individually in a
9610 variable.
9611
9612 (In the following descriptions, we'll ignore composite characters for
9613 the moment. We also give a general (structural) overview first,
9614 followed later by the exact details.)
9615
9616 @menu
9617 * Internal String Encoding::
9618 * Internal Character Encoding::
9619 @end menu
9620
9621 @node Internal String Encoding, Internal Character Encoding, Internal Mule Encodings, Internal Mule Encodings
9622 @subsection Internal String Encoding
9623 @cindex internal string encoding
9624 @cindex string encoding, internal
9625 @cindex encoding, internal string
9626
9627 ASCII characters are encoded using their position code directly. Other
9628 characters are encoded using their leading byte followed by their
9629 position code(s) with the high bit set. Characters in private character
9630 sets have their leading byte prefixed with a @dfn{leading byte prefix},
9631 which is either 0x9E or 0x9F. (No character sets are ever assigned these
9632 leading bytes.) Specifically:
9633
9634 @example
9635 Character set Encoding (PC=position-code, LB=leading-byte)
9636 ------------- --------
9637 ASCII PC-1 |
9638 Control-1 LB | PC1 + 0xA0 |
9639 Dimension-1 official LB | PC1 + 0x80 |
9640 Dimension-1 private 0x9E | LB | PC1 + 0x80 |
9641 Dimension-2 official LB | PC1 + 0x80 | PC2 + 0x80 |
9642 Dimension-2 private 0x9F | LB | PC1 + 0x80 | PC2 + 0x80
9643 @end example
9644
9645 The basic characteristic of this encoding is that the first byte
9646 of all characters is in the range 0x00 - 0x9F, and the second and
9647 following bytes of all characters is in the range 0xA0 - 0xFF.
9648 This means that it is impossible to get out of sync, or more
9649 specifically:
9650
9651 @enumerate
9652 @item
9653 Given any byte position, the beginning of the character it is
9654 within can be determined in constant time.
9655 @item
9656 Given any byte position at the beginning of a character, the
9657 beginning of the next character can be determined in constant
9658 time.
9659 @item
9660 Given any byte position at the beginning of a character, the
9661 beginning of the previous character can be determined in constant
9662 time.
9663 @item
9664 Textual searches can simply treat encoded strings as if they
9665 were encoded in a one-byte-per-character fashion rather than
9666 the actual multi-byte encoding.
9667 @end enumerate
9668
9669 None of the standard non-modal encodings meet all of these
9670 conditions. For example, EUC satisfies only (2) and (3), while
9671 Shift-JIS and Big5 (not yet described) satisfy only (2). (All
9672 non-modal encodings must satisfy (2), in order to be unambiguous.)
9673
9674 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings
9675 @subsection Internal Character Encoding
9676 @cindex internal character encoding
9677 @cindex character encoding, internal
9678 @cindex encoding, internal character
9679
9680 One 19-bit word represents a single character. The word is
9681 separated into three fields:
9682
9683 @example
9684 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
9685 <------------> <------------------> <------------------>
9686 Field: 1 2 3
9687 @end example
9688
9689 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits.
9690
9691 @example
9692 Character set Field 1 Field 2 Field 3
9693 ------------- ------- ------- -------
9694 ASCII 0 0 PC1
9695 range: (00 - 7F)
9696 Control-1 0 1 PC1
9697 range: (00 - 1F)
9698 Dimension-1 official 0 LB - 0x7F PC1
9699 range: (01 - 0D) (20 - 7F)
9700 Dimension-1 private 0 LB - 0x80 PC1
9701 range: (20 - 6F) (20 - 7F)
9702 Dimension-2 official LB - 0x8F PC1 PC2
9703 range: (01 - 0A) (20 - 7F) (20 - 7F)
9704 Dimension-2 private LB - 0xE1 PC1 PC2
9705 range: (0F - 1E) (20 - 7F) (20 - 7F)
9706 Composite 0x1F ? ?
9707 @end example
9708
9709 Note that character codes 0 - 255 are the same as the ``binary
9710 encoding'' described above.
9711
9712 Most of the code in XEmacs knows nothing of the representation of a
9713 character other than that values 0 - 255 represent ASCII, Control 1,
9714 and Latin 1.
9715
9716 @strong{WARNING WARNING WARNING}: The Boyer-Moore code in
9717 @file{search.c}, and the code in @code{search_buffer()} that determines
9718 whether that code can be used, knows that ``field 3'' in a character
9719 always corresponds to the last byte in the textual representation of the
9720 character. (This is important because the Boyer-Moore algorithm works by
9721 looking at the last byte of the search string and &&#### finish this.
9722
9723 @node Byte/Character Types; Buffer Positions; Other Typedefs, Internal Text API's, Internal Mule Encodings, Multilingual Support
9724 @section Byte/Character Types; Buffer Positions; Other Typedefs
9725 @cindex byte/character types; buffer positions; other typedefs
9726 @cindex byte/character types
9727 @cindex character types
9728 @cindex buffer positions
9729 @cindex typedefs, other
9730
9731 @menu
9732 * Byte Types::
9733 * Different Ways of Seeing Internal Text::
9734 * Buffer Positions::
9735 * Other Typedefs::
9736 * Usage of the Various Representations::
9737 * Working With the Various Representations::
9738 @end menu
9739
9740 @node Byte Types, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs
9741 @subsection Byte Types
9742 @cindex byte types
9743
9744 Stuff pointed to by a char * or unsigned char * will nearly always be
9745 one of the following types:
9746
9747 @itemize @minus
9748 @item
9749 a) [Ibyte] pointer to internally-formatted text
9750 @item
9751 b) [Extbyte] pointer to text in some external format, which can be
9752 defined as all formats other than the internal one
9753 @item
9754 c) [Ascbyte] pure ASCII text
9755 @item
9756 d) [Binbyte] binary data that is not meant to be interpreted as text
9757 @item
9758 e) [Rawbyte] general data in memory, where we don't care about whether
9759 it's text or binary
9760 @item
9761 f) [Boolbyte] a zero or a one
9762 @item
9763 g) [Bitbyte] a byte used for bit fields
9764 @item
9765 h) [Chbyte] null-semantics @code{char *}; used when casting an argument to
9766 an external API where the the other types may not be
9767 appropriate
9768 @end itemize
9769
9770 Types (b), (c), (f) and (h) are defined as @code{char}, while the others are
9771 @code{unsigned char}. This is for maximum safety (signed characters are
9772 dangerous to work with) while maintaining as much compatibility with
9773 external API's and string constants as possible.
9774
9775 We also provide versions of the above types defined with different
9776 underlying C types, for API compatibility. These use the following
9777 prefixes:
9778
9779 @example
9780 C = plain char, when the base type is unsigned
9781 U = unsigned
9782 S = signed
9783 @end example
9784
9785 (Formerly I had a comment saying that type (e) "should be replaced with
9786 void *". However, there are in fact many places where an unsigned char
9787 * might be used -- e.g. for ease in pointer computation, since void *
9788 doesn't allow this, and for compatibility with external API's.)
9789
9790 Note that these typedefs are purely for documentation purposes; from
9791 the C code's perspective, they are exactly equivalent to @code{char *},
9792 @code{unsigned char *}, etc., so you can freely use them with library
9793 functions declared as such.
9794
9795 Using these more specific types rather than the general ones helps avoid
9796 the confusions that occur when the semantics of a char * or unsigned
9797 char * argument being studied are unclear. Furthermore, by requiring
9798 that ALL uses of @code{char} be replaced with some other type as part of the
9799 Mule-ization process, we can use a search for @code{char} as a way of finding
9800 code that has not been properly Mule-ized yet.
9801
9802 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs
9803 @subsection Different Ways of Seeing Internal Text
9804 @cindex different ways of seeing internal text
9805
9806 There are various ways of representing internal text. The two primary
9807 ways are as an "array" of individual characters; the other is as a
9808 "stream" of bytes. In the ASCII world, where there are only 255
9809 characters at most, things are easy because each character fits into a
9810 byte. In general, however, this is not true -- see the above discussion
9811 of characters vs. encodings.
9812
9813 In some cases, it's also important to distinguish between a stream
9814 representation as a series of bytes and as a series of textual units.
9815 This is particularly important wrt Unicode. The UTF-16 representation
9816 (sometimes referred to, rather sloppily, as simply the "Unicode" format)
9817 represents text as a series of 16-bit units. Mostly, each unit
9818 corresponds to a single character, but not necessarily, as characters
9819 outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of
9820 Unicode) require two 16-bit units, through the mechanism of
9821 "surrogates". When a series of 16-bit units is serialized into a byte
9822 stream, there are at least two possible representations, little-endian
9823 and big-endian, and which one is used may depend on the native format of
9824 16-bit integers in the CPU of the machine that XEmacs is running
9825 on. (Similarly, UTF-32 is logically a representation with 32-bit textual
9826 units.)
9827
9828 Specifically:
9829
9830 @itemize @minus
9831 @item
9832 UTF-8 has 1-byte (8-bit) units.
9833 @item
9834 UTF-16 has 2-byte (16-bit) units.
9835 @item
9836 UTF-32 has 4-byte (32-bit) units.
9837 @item
9838 XEmacs-internal encoding (the old "Mule" encoding) has 1-byte (8-bit)
9839 units.
9840 @item
9841 UTF-7 technically has 7-bit units that are within the "mail-safe" range
9842 (ASCII 32 - 126 plus a few control characters), but normally is encoded
9843 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a
9844 normal mode where printable ASCII characters represent themselves and a
9845 shifted mode, introduced with a plus sign, where a base-64 encoding is
9846 used.)
9847 @item
9848 UTF-5 technically has 7-bit units (normally encoded in an 8-bit stream,
9849 like UTF-7), but only uses uppercase A-V and 0-9, and only encodes 4
9850 bits worth of data per character. UTF-5 is meant for encoding Unicode
9851 inside of DNS names.
9852 @end itemize
9853
9854 Thus, we can imagine three levels in the representation of texual data:
9855
9856 @example
9857 series of characters -> series of textual units -> series of bytes
9858 [Ichar] [Itext] [Ibyte]
9859 @end example
9860
9861 XEmacs has three corresponding typedefs:
9862
9863 @itemize @minus
9864 @item
9865 An Ichar is an integer (at least 32-bit), representing a 31-bit
9866 character.
9867 @item
9868 An Itext is an unsigned value, either 8, 16 or 32 bits, depending
9869 on the nature of the internal representation, and corresponding to
9870 a single textual unit.
9871 @item
9872 An Ibyte is an @code{unsigned char}, representing a single byte in a
9873 textual byte stream.
9874 @end itemize
9875
9876 Internal text in stream format can be simultaneously viewed as either
9877 @code{Itext *} or @code{Ibyte *}. The @code{Ibyte *} representation is convenient for
9878 copying data from one place to another, because such routines usually
9879 expect byte counts. However, @code{Itext *} is much better for actually
9880 working with the data.
9881
9882 From a text-unit perspective, units 0 through 127 will always be ASCII
9883 compatible, and data in Lisp strings (and other textual data generated
9884 as a whole, e.g. from external conversion) will be followed by a
9885 null-unit terminator. From an @code{Ibyte *} perspective, however, the
9886 encoding is only ASCII-compatible if it uses 1-byte units.
9887
9888 Similarly to the different text representations, three integral count
9889 types exist -- Charcount, Textcount and Bytecount.
9890
9891 NOTE: Despite the presence of the terminator, internal text itself can
9892 have nulls in it! (Null text units, not just the null bytes present in
9893 any UTF-16 encoding.) The terminator is present because in many cases
9894 internal text is passed to routines that will ultimately pass the text
9895 to library functions that cannot handle embedded nulls, e.g. functions
9896 manipulating filenames, and it is a real hassle to have to pass the
9897 length around constantly. But this can lead to sloppy coding! We need
9898 to be careful about watching for nulls in places that are important,
9899 e.g. manipulating string objects or passing data to/from the clipboard.
9900
9901 @table @code
9902 @item Ibyte
9903 The data in a buffer or string is logically made up of Ibyte objects,
9904 where a Ibyte takes up the same amount of space as a char. (It is
9905 declared differently, though, to catch invalid usages.) Strings stored
9906 using Ibytes are said to be in "internal format". The important
9907 characteristics of internal format are
9908
9909 @itemize @minus
9910 @item
9911 ASCII characters are represented as a single Ibyte, in the range 0 -
9912 0x7f.
9913 @item
9914 All other characters are represented as a Ibyte in the range 0x80 - 0x9f
9915 followed by one or more Ibytes in the range 0xa0 to 0xff.
9916 @end itemize
9917
9918 This leads to a number of desirable properties:
9919
9920 @itemize @minus
9921 @item
9922 Given the position of the beginning of a character, you can find the
9923 beginning of the next or previous character in constant time.
9924 @item
9925 When searching for a substring or an ASCII character within the string,
9926 you need merely use standard searching routines.
9927 @end itemize
9928
9929 @item Itext
9930
9931 #### Document me.
9932
9933 @item Ichar
9934 This typedef represents a single Emacs character, which can be ASCII,
9935 ISO-8859, or some extended character, as would typically be used for
9936 Kanji. Note that the representation of a character as an Ichar is @strong{not}
9937 the same as the representation of that same character in a string; thus,
9938 you cannot do the standard C trick of passing a pointer to a character
9939 to a function that expects a string.
9940
9941 An Ichar takes up 19 bits of representation and (for code compatibility
9942 and such) is compatible with an int. This representation is visible on
9943 the Lisp level. The important characteristics of the Ichar
9944 representation are
9945
9946 @itemize @minus
9947 @item
9948 values 0x00 - 0x7f represent ASCII.
9949 @item
9950 values 0x80 - 0xff represent the right half of ISO-8859-1.
9951 @item
9952 values 0x100 and up represent all other characters.
9953 @end itemize
9954
9955 This means that Ichar values are upwardly compatible with the standard
9956 8-bit representation of ASCII/ISO-8859-1.
9957
9958 @item Extbyte
9959 Strings that go in or out of Emacs are in "external format", typedef'ed
9960 as an array of char or a char *. There is more than one external format
9961 (JIS, EUC, etc.) but they all have similar properties. They are modal
9962 encodings, which is to say that the meaning of particular bytes is not
9963 fixed but depends on what "mode" the string is currently in (e.g. bytes
9964 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or
9965 as 2-byte Kanji, depending on the current mode). The mode starts out in
9966 ASCII/ISO-8859-1 and is switched using escape sequences -- for example,
9967 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes
9968 in the range 0 - 0x7f are interpreted as Kanji characters.
9969
9970 External-formatted data is generally desirable for passing data between
9971 programs because it is upwardly compatible with standard
9972 ASCII/ISO-8859-1 strings and may require less space than internal
9973 encodings such as the one described above. In addition, some encodings
9974 (e.g. JIS) keep all characters (except the ESC used to switch modes) in
9975 the printing ASCII range 0x20 - 0x7e, which results in a much higher
9976 probability that the data will avoid being garbled in transmission.
9977 Externally-formatted data is generally not very convenient to work with,
9978 however, and for this reason is usually converted to internal format
9979 before any work is done on the string.
9980
9981 NOTE: filenames need to be in external format so that ISO-8859-1
9982 characters come out correctly.
9983 @end table
9984
9985 @node Buffer Positions, Other Typedefs, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs
9986 @subsection Buffer Positions
9987 @cindex buffer positions
9988
9989 There are three possible ways to specify positions in a buffer. All
9990 of these are one-based: the beginning of the buffer is position or
9991 index 1, and 0 is not a valid position.
9992
9993 As a "buffer position" (typedef Charbpos):
9994
9995 This is an index specifying an offset in characters from the
9996 beginning of the buffer. Note that buffer positions are
9997 logically @strong{between} characters, not on a character. The
9998 difference between two buffer positions specifies the number of
9999 characters between those positions. Buffer positions are the
10000 only kind of position externally visible to the user.
10001
10002 As a "byte index" (typedef Bytebpos):
10003
10004 This is an index over the bytes used to represent the characters
10005 in the buffer. If there is no Mule support, this is identical
10006 to a buffer position, because each character is represented
10007 using one byte. However, with Mule support, many characters
10008 require two or more bytes for their representation, and so a
10009 byte index may be greater than the corresponding buffer
10010 position.
10011
10012 As a "memory index" (typedef Membpos):
10013
10014 This is the byte index adjusted for the gap. For positions
10015 before the gap, this is identical to the byte index. For
10016 positions after the gap, this is the byte index plus the gap
10017 size. There are two possible memory indices for the gap
10018 position; the memory index at the beginning of the gap should
10019 always be used, except in code that deals with manipulating the
10020 gap, where both indices may be seen. The address of the
10021 character "at" (i.e. following) a particular position can be
10022 obtained from the formula
10023
10024 buffer_start_address + memory_index(position) - 1
10025
10026 except in the case of characters at the gap position.
10027
10028 @node Other Typedefs, Usage of the Various Representations, Buffer Positions, Byte/Character Types; Buffer Positions; Other Typedefs
10029 @subsection Other Typedefs
10030 @cindex other typedefs
10031
10032 Charcount:
10033 ----------
10034 This typedef represents a count of characters, such as
10035 a character offset into a string or the number of
10036 characters between two positions in a buffer. The
10037 difference between two Charbpos's is a Charcount, and
10038 character positions in a string are represented using
10039 a Charcount.
10040
10041 Textcount:
10042 ----------
10043 #### Document me.
10044
10045 Bytecount:
10046 ----------
10047 Similar to a Charcount but represents a count of bytes.
10048 The difference between two Bytebpos's is a Bytecount.
10049
10050
10051 @node Usage of the Various Representations, Working With the Various Representations, Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs
10052 @subsection Usage of the Various Representations
10053 @cindex usage of the various representations
10054
10055 Memory indices are used in low-level functions in insdel.c and for
10056 extent endpoints and marker positions. The reason for this is that
10057 this way, the extents and markers don't need to be updated for most
10058 insertions, which merely shrink the gap and don't move any
10059 characters around in memory.
10060
10061 (The beginning-of-gap memory index simplifies insertions w.r.t.
10062 markers, because text usually gets inserted after markers. For
10063 extents, it is merely for consistency, because text can get
10064 inserted either before or after an extent's endpoint depending on
10065 the open/closedness of the endpoint.)
10066
10067 Byte indices are used in other code that needs to be fast,
10068 such as the searching, redisplay, and extent-manipulation code.
10069
10070 Buffer positions are used in all other code. This is because this
10071 representation is easiest to work with (especially since Lisp
10072 code always uses buffer positions), necessitates the fewest
10073 changes to existing code, and is the safest (e.g. if the text gets
10074 shifted underneath a buffer position, it will still point to a
10075 character; if text is shifted under a byte index, it might point
10076 to the middle of a character, which would be bad).
10077
10078 Similarly, Charcounts are used in all code that deals with strings
10079 except for code that needs to be fast, which used Bytecounts.
10080
10081 Strings are always passed around internally using internal format.
10082 Conversions between external format are performed at the time
10083 that the data goes in or out of Emacs.
10084
10085 @node Working With the Various Representations, , Usage of the Various Representations, Byte/Character Types; Buffer Positions; Other Typedefs
10086 @subsection Working With the Various Representations
10087 @cindex working with the various representations
10088
10089 We write things this way because it's very important the
10090 MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens,
10091 65535 is a multiple of 3, but this may not always be the
10092 case. #### unfinished
10093
10094 @node Internal Text API's, Coding for Mule, Byte/Character Types; Buffer Positions; Other Typedefs, Multilingual Support
10095 @section Internal Text API's
10096 @cindex internal text API's
10097 @cindex text API's, internal
10098 @cindex API's, text, internal
10099
10100 @strong{NOTE}: The most current documentation for these API's is in
10101 @file{text.h}. In case of error, assume that file is correct and this
10102 one wrong.
10103
10104 @menu
10105 * Basic internal-format API's::
10106 * The DFC API::
10107 * The Eistring API::
10108 @end menu
10109
10110 @node Basic internal-format API's, The DFC API, Internal Text API's, Internal Text API's
10111 @subsection Basic internal-format API's
10112 @cindex basic internal-format API's
10113 @cindex internal-format API's, basic
10114 @cindex API's, basic internal-format
10115
10116 These are simple functions and macros to convert between text
10117 representation and characters, move forward and back in text, etc.
10118
10119 #### Finish the rest of this.
10120
10121 Use the following functions/macros on contiguous text in any of the
10122 internal formats. Those that take a format arg work on all internal
10123 formats; the others work only on the default (variable-width under Mule)
10124 format. If the text you're operating on is known to come from a buffer,
10125 use the buffer-level functions in buffer.h, which automatically know the
10126 correct format and handle the gap.
10127
10128 Some terminology:
10129
10130 "itext" appearing in the macros means "internal-format text" -- type
10131 @code{Ibyte *}. Operations on such pointers themselves, rather than on the
10132 text being pointed to, have "itext" instead of "itext" in the macro
10133 name. "ichar" in the macro names means an Ichar -- the representation
10134 of a character as a single integer rather than a series of bytes, as part
10135 of "itext". Many of the macros below are for converting between the
10136 two representations of characters.
10137
10138 Note also that we try to consistently distinguish between an "Ichar" and
10139 a Lisp character. Stuff working with Lisp characters often just says
10140 "char", so we consistently use "Ichar" when that's what we're working
10141 with.
10142
10143 @node The DFC API, The Eistring API, Basic internal-format API's, Internal Text API's
10144 @subsection The DFC API
10145 @cindex DFC API
10146 @cindex API, DFC
10147
10148 This is for conversion between internal and external text. Note that
10149 there is also the "new DFC" API, which @strong{returns} a pointer to the
10150 converted text (in alloca space), rather than storing it into a
10151 variable.
10152
10153 The macros below are used for converting data between different formats.
10154 Generally, the data is textual, and the formats are related to
10155 internationalization (e.g. converting between internal-format text and
10156 UTF-8) -- but the mechanism is general, and could be used for anything,
10157 e.g. decoding gzipped data.
10158
10159 In general, conversion involves a source of data, a sink, the existing
10160 format of the source data, and the desired format of the sink. The
10161 macros below, however, always require that either the source or sink is
10162 internal-format text. Therefore, in practice the conversions below
10163 involve source, sink, an external format (specified by a coding system),
10164 and the direction of conversion (internal->external or vice-versa).
10165
10166 Sources and sinks can be raw data (sized or unsized -- when unsized,
10167 input data is assumed to be null-terminated [double null-terminated for
10168 Unicode-format data], and on output the length is not stored anywhere),
10169 Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the
10170 output is raw data, the result can be allocated either with @code{alloca()} or
10171 @code{malloc()}. (There is currently no provision for writing into a fixed
10172 buffer. If you want this, use @code{alloca()} output and then copy the data --
10173 but be careful with the size! Unless you are very sure of the encoding
10174 being used, upper bounds for the size are not in general computable.)
10175 The obvious restrictions on source and sink types apply (e.g. Lisp
10176 strings are a source and sink only for internal data).
10177
10178 All raw data outputted will contain an extra null byte (two bytes for
10179 Unicode -- currently, in fact, all output data, whether internal or
10180 external, is double-null-terminated, but you can't count on this; see
10181 below). This means that enough space is allocated to contain the extra
10182 nulls; however, these nulls are not reflected in the returned output
10183 size.
10184
10185 The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT.
10186 These can be used to convert between any kinds of sources or sinks.
10187 However, 99% of conversions involve raw data or Lisp strings as both
10188 source and sink, and usually data is output as @code{alloca()} rather than
10189 @code{malloc()}. For this reason, convenience macros are defined for many types
10190 of conversions involving raw data and/or Lisp strings, especially when
10191 the output is an @code{alloca()}ed string. (When the destination is a
10192 Lisp_String, there are other functions that should be used instead --
10193 @code{build_ext_string()} and @code{make_ext_string()}, for example.) The convenience
10194 macros are of two types -- the older kind that store the result into a
10195 specified variable, and the newer kind that return the result. The newer
10196 kind of macros don't exist when the output is sized data, because that
10197 would have two return values. NOTE: All convenience macros are
10198 ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT.
10199 Thus, any comments below about the workings of these macros also apply to
10200 all convenience macros.
10201
10202 @example
10203 TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
10204 TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
10205 @end example
10206
10207 Typical use is
10208
10209 @example
10210 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);
10211 @end example
10212
10213 which means that the contents of the lisp string @var{str} are written
10214 to a malloc'ed memory area which will be pointed to by @var{ptr}, after the
10215 function returns. The conversion will be done using the @code{file-name}
10216 coding system (which will be controlled by the user indirectly by
10217 setting or binding the variable @code{file-name-coding-system}).
10218
10219 Some sources and sinks require two C variables to specify. We use
10220 some preprocessor magic to allow different source and sink types, and
10221 even different numbers of arguments to specify different types of
10222 sources and sinks.
10223
10224 So we can have a call that looks like
10225
10226 @example
10227 TO_INTERNAL_FORMAT (DATA, (ptr, len),
10228 MALLOC, (ptr, len),
10229 coding_system);
10230 @end example
10231
10232 The parenthesized argument pairs are required to make the
10233 preprocessor magic work.
10234
10235 NOTE: GC is inhibited during the entire operation of these macros. This
10236 is because frequently the data to be converted comes from strings but
10237 gets passed in as just DATA, and GC may move around the string data. If
10238 we didn't inhibit GC, there'd have to be a lot of messy recoding,
10239 alloca-copying of strings and other annoying stuff.
10240
10241 The source or sink can be specified in one of these ways:
10242
10243 @example
10244 DATA, (ptr, len), // input data is a fixed buffer of size len
10245 ALLOCA, (ptr, len), // output data is in a @code{ALLOCA()}ed buffer of size len
10246 MALLOC, (ptr, len), // output data is in a @code{malloc()}ed buffer of size len
10247 C_STRING_ALLOCA, ptr, // equivalent to ALLOCA (ptr, len_ignored) on output
10248 C_STRING_MALLOC, ptr, // equivalent to MALLOC (ptr, len_ignored) on output
10249 C_STRING, ptr, // equivalent to DATA, (ptr, strlen/wcslen (ptr))
10250 // on input (the Unicode version is used when correct)
10251 LISP_STRING, string, // input or output is a Lisp_Object of type string
10252 LISP_BUFFER, buffer, // output is written to (point) in lisp buffer
10253 LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream
10254 LISP_OPAQUE, object, // input or output is a Lisp_Object of type opaque
10255 @end example
10256
10257 When specifying the sink, use lvalues, since the macro will assign to them,
10258 except when the sink is an lstream or a lisp buffer.
10259
10260 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the resulting text is
10261 stored in a stack-allocated buffer, which is automatically freed on
10262 returning from the function. However, the sink types @code{MALLOC} and
10263 @code{C_STRING_MALLOC} return @code{xmalloc()}ed memory. The caller is responsible
10264 for freeing this memory using @code{xfree()}.
10265
10266 The macros accept the kinds of sources and sinks appropriate for
10267 internal and external data representation. See the type_checking_assert
10268 macros below for the actual allowed types.
10269
10270 Since some sources and sinks use one argument (a Lisp_Object) to
10271 specify them, while others take a (pointer, length) pair, we use
10272 some C preprocessor trickery to allow pair arguments to be specified
10273 by parenthesizing them, as in the examples above.
10274
10275 Anything prefixed by dfc_ (`data format conversion') is private.
10276 They are only used to implement these macros.
10277
10278 [[Using C_STRING* is appropriate for using with external APIs that
10279 take null-terminated strings. For internal data, we should try to
10280 be '\0'-clean - i.e. allow arbitrary data to contain embedded '\0'.
10281
10282 Sometime in the future we might allow output to C_STRING_ALLOCA or
10283 C_STRING_MALLOC _only_ with @code{TO_EXTERNAL_FORMAT()}, not
10284 @code{TO_INTERNAL_FORMAT()}.]]
10285
10286 The above comments are not true. Frequently (most of the time, in
10287 fact), external strings come as zero-terminated entities, where the
10288 zero-termination is the only way to find out the length. Even in
10289 cases where you can get the length, most of the time the system will
10290 still use the null to signal the end of the string, and there will
10291 still be no way to either send in or receive a string with embedded
10292 nulls. In such situations, it's pointless to track the length
10293 because null bytes can never be in the string. We have a lot of
10294 operations that make it easy to operate on zero-terminated strings,
10295 and forcing the user the deal with the length everywhere would only
10296 make the code uglier and more complicated, for no gain. --ben
10297
10298 There is no problem using the same lvalue for source and sink.
10299
10300 Also, when pointers are required, the code (currently at least) is
10301 lax and allows any pointer types, either in the source or the sink.
10302 This makes it possible, e.g., to deal with internal format data held
10303 in char *'s or external format data held in WCHAR * (i.e. Unicode).
10304
10305 Finally, whenever storage allocation is called for, extra space is
10306 allocated for a terminating zero, and such a zero is stored in the
10307 appropriate place, regardless of whether the source data was
10308 specified using a length or was specified as zero-terminated. This
10309 allows you to freely pass the resulting data, no matter how
10310 obtained, to a routine that expects zero termination (modulo, of
10311 course, that any embedded zeros in the resulting text will cause
10312 truncation). In fact, currently two embedded zeros are allocated
10313 and stored after the data result. This is to allow for the
10314 possibility of storing a Unicode value on output, which needs the
10315 two zeros. Currently, however, the two zeros are stored regardless
10316 of whether the conversion is internal or external and regardless of
10317 whether the external coding system is in fact Unicode. This
10318 behavior may change in the future, and you cannot rely on this --
10319 the most you can rely on is that sink data in Unicode format will
10320 have two terminating nulls, which combine to form one Unicode null
10321 character.
10322
10323 NOTE: You might ask, why are these not written as functions that
10324 @strong{RETURN} the converted string, since that would allow them to be used
10325 much more conveniently, without having to constantly declare temporary
10326 variables? The answer is that in fact I originally did write the
10327 routines that way, but that required either
10328
10329 @itemize @bullet
10330 @item
10331 (a) calling @code{alloca()} inside of a function call, or
10332 @item
10333 (b) using expressions separated by commas and a global temporary variable, or
10334 @item
10335 (c) using the GCC extension (@{ ... @}).
10336 @end itemize
10337
10338 Turned out that all of the above had bugs, all caused by GCC (hence the
10339 comments about "those GCC wankers" and "ream gcc up the ass"). As for
10340 (a), some versions of GCC (especially on Intel platforms), which had
10341 buggy implementations of @code{alloca()} that couldn't handle being called
10342 inside of a function call -- they just decremented the stack right in the
10343 middle of pushing args. Oops, crash with stack trashing, very bad. (b)
10344 was an attempt to fix (a), and that led to further GCC crashes, esp. when
10345 you had two such calls in a single subexpression, because GCC couldn't be
10346 counted upon to follow even a minimally reasonable order of execution.
10347 True, you can't count on one argument being evaluated before another, but
10348 GCC would actually interleave them so that the temp var got stomped on by
10349 one while the other was accessing it. So I tried (c), which was
10350 problematic because that GCC extension has more bugs in it than a
10351 termite's nest.
10352
10353 So reluctantly I converted to the current way. Now, that was awhile ago
10354 (c. 1994), and it appears that the bug involving alloca in function calls
10355 has long since been fixed. More recently, I defined the new-dfc routines
10356 down below, which DO allow exactly such convenience of returning your
10357 args rather than store them in temp variables, and I also wrote a
10358 configure check to see whether @code{alloca()} causes crashes inside of function
10359 calls, and if so use the portable @code{alloca()} implementation in alloca.c.
10360 If you define TEST_NEW_DFC, the old routines get written in terms of the
10361 new ones, and I've had a beta put out with this on and it appeared to
10362 this appears to cause no problems -- so we should consider
10363 switching, and feel no compunctions about writing further such function-
10364 like @code{alloca()} routines in lieu of statement-like ones. --ben
10365
10366 @node The Eistring API, , The DFC API, Internal Text API's
10367 @subsection The Eistring API
10368 @cindex Eistring API
10369 @cindex API, Eistring
10370
10371 (This API is currently under-used) When doing simple things with
10372 internal text, the basic internal-format API's are enough. But to do
10373 things like delete or replace a substring, concatenate various strings,
10374 etc. is difficult to do cleanly because of the allocation issues.
10375 The Eistring API is designed to deal with this, and provides a clean
10376 way of modifying and building up internal text. (Note that the former
10377 lack of this API has meant that some code uses Lisp strings to do
10378 similar manipulations, resulting in excess garbage and increased
10379 garbage collection.)
10380
10381 NOTE: The Eistring API is (or should be) Mule-correct even without
10382 an ASCII-compatible internal representation.
10383
10384 @example
10385 #### NOTE: This is a work in progress. Neither the API nor especially
10386 the implementation is finished.
10387
10388 NOTE: An Eistring is a structure that makes it easy to work with
10389 internally-formatted strings of data. It provides operations similar
10390 in feel to the standard @code{strcpy()}, @code{strcat()}, @code{strlen()}, etc., but
10391
10392 (a) it is Mule-correct
10393 (b) it does dynamic allocation so you never have to worry about size
10394 restrictions
10395 (c) it comes in an @code{ALLOCA()} variety (all allocation is stack-local,
10396 so there is no need to explicitly clean up) as well as a @code{malloc()}
10397 variety
10398 (d) it knows its own length, so it does not suffer from standard null
10399 byte brain-damage -- but it null-terminates the data anyway, so
10400 it can be passed to standard routines
10401 (e) it provides a much more powerful set of operations and knows about
10402 all the standard places where string data might reside: Lisp_Objects,
10403 other Eistrings, Ibyte * data with or without an explicit length,
10404 ASCII strings, Ichars, etc.
10405 (f) it provides easy operations to convert to/from externally-formatted
10406 data, and is easier to use than the standard TO_INTERNAL_FORMAT
10407 and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal
10408 and external version of its data, but the external version is only
10409 initialized or changed when you call @code{eito_external()}.)
10410
10411 The idea is to make it as easy to write Mule-correct string manipulation
10412 code as it is to write normal string manipulation code. We also make
10413 the API sufficiently general that it can handle multiple internal data
10414 formats (e.g. some fixed-width optimizing formats and a default variable
10415 width format) and allows for @strong{ANY} data format we might choose in the
10416 future for the default format, including UCS2. (In other words, we can't
10417 assume that the internal format is ASCII-compatible and we can't assume
10418 it doesn't have embedded null bytes. We do assume, however, that any
10419 chosen format will have the concept of null-termination.) All of this is
10420 hidden from the user.
10421
10422 #### It is really too bad that we don't have a real object-oriented
10423 language, or at least a language with polymorphism!
10424
10425
10426 **********************************************
10427 * Declaration *
10428 **********************************************
10429
10430 To declare an Eistring, either put one of the following in the local
10431 variable section:
10432
10433 DECLARE_EISTRING (name);
10434 Declare a new Eistring and initialize it to the empy string. This
10435 is a standard local variable declaration and can go anywhere in the
10436 variable declaration section. NAME itself is declared as an
10437 Eistring *, and its storage declared on the stack.
10438
10439 DECLARE_EISTRING_MALLOC (name);
10440 Declare and initialize a new Eistring, which uses @code{malloc()}ed
10441 instead of @code{ALLOCA()}ed data. This is a standard local variable
10442 declaration and can go anywhere in the variable declaration
10443 section. Once you initialize the Eistring, you will have to free
10444 it using @code{eifree()} to avoid memory leaks. You will need to use this
10445 form if you are passing an Eistring to any function that modifies
10446 it (otherwise, the modified data may be in stack space and get
10447 overwritten when the function returns).
10448
10449 or use
10450
10451 Eistring ei;
10452 void eiinit (Eistring *ei);
10453 void eiinit_malloc (Eistring *einame);
10454 If you need to put an Eistring elsewhere than in a local variable
10455 declaration (e.g. in a structure), declare it as shown and then
10456 call one of the init macros.
10457
10458 Also note:
10459
10460 void eifree (Eistring *ei);
10461 If you declared an Eistring to use @code{malloc()} to hold its data,
10462 or converted it to the heap using @code{eito_malloc()}, then this
10463 releases any data in it and afterwards resets the Eistring
10464 using @code{eiinit_malloc()}. Otherwise, it just resets the Eistring
10465 using @code{eiinit()}.
10466
10467
10468 **********************************************
10469 * Conventions *
10470 **********************************************
10471
10472 - The names of the functions have been chosen, where possible, to
10473 match the names of @code{str*()} functions in the standard C API.
10474 -
10475
10476
10477 **********************************************
10478 * Initialization *
10479 **********************************************
10480
10481 void eireset (Eistring *eistr);
10482 Initialize the Eistring to the empty string.
10483
10484 void eicpy_* (Eistring *eistr, ...);
10485 Initialize the Eistring from somewhere:
10486
10487 void eicpy_ei (Eistring *eistr, Eistring *eistr2);
10488 ... from another Eistring.
10489 void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string);
10490 ... from a Lisp_Object string.
10491 void eicpy_ch (Eistring *eistr, Ichar ch);
10492 ... from an Ichar (this can be a conventional C character).
10493
10494 void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string,
10495 Bytecount off, Charcount charoff,
10496 Bytecount len, Charcount charlen);
10497 ... from a section of a Lisp_Object string.
10498 void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf,
10499 Bytecount off, Charcount charoff,
10500 Bytecount len, Charcount charlen);
10501 ... from a section of a Lisp_Object buffer.
10502 void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len);
10503 ... from raw internal-format data in the default internal format.
10504 void eicpy_rawz (Eistring *eistr, const Ibyte *data);
10505 ... from raw internal-format data in the default internal format
10506 that is "null-terminated" (the meaning of this depends on the nature
10507 of the default internal format).
10508 void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len,
10509 Internal_Format intfmt, Lisp_Object object);
10510 ... from raw internal-format data in the specified format.
10511 void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data,
10512 Internal_Format intfmt, Lisp_Object object);
10513 ... from raw internal-format data in the specified format that is
10514 "null-terminated" (the meaning of this depends on the nature of
10515 the specific format).
10516 void eicpy_c (Eistring *eistr, const Ascbyte *c_string);
10517 ... from an ASCII null-terminated string. Non-ASCII characters in
10518 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
10519 void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len);
10520 ... from an ASCII string, with length specified. Non-ASCII characters
10521 in the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
10522 void eicpy_ext (Eistring *eistr, const Extbyte *extdata,
10523 Lisp_Object codesys);
10524 ... from external null-terminated data, with coding system specified.
10525 void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata,
10526 Bytecount extlen, Lisp_Object codesys);
10527 ... from external data, with length and coding system specified.
10528 void eicpy_lstream (Eistring *eistr, Lisp_Object lstream);
10529 ... from an lstream; reads data till eof. Data must be in default
10530 internal format; otherwise, interpose a decoding lstream.
10531
10532
10533 **********************************************
10534 * Getting the data out of the Eistring *
10535 **********************************************
10536
10537 Ibyte *eidata (Eistring *eistr);
10538 Return a pointer to the raw data in an Eistring. This is NOT
10539 a copy.
10540
10541 Lisp_Object eimake_string (Eistring *eistr);
10542 Make a Lisp string out of the Eistring.
10543
10544 Lisp_Object eimake_string_off (Eistring *eistr,
10545 Bytecount off, Charcount charoff,
10546 Bytecount len, Charcount charlen);
10547 Make a Lisp string out of a section of the Eistring.
10548
10549 void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out,
10550 LVALUE: Bytecount len_out);
10551 Make an @code{ALLOCA()} copy of the data in the Eistring, using the
10552 default internal format. Due to the nature of @code{ALLOCA()}, this
10553 must be a macro, with all lvalues passed in as parameters.
10554 (More specifically, not all compilers correctly handle using
10555 @code{ALLOCA()} as the argument to a function call -- GCC on x86
10556 didn't used to, for example.) A pointer to the @code{ALLOCA()}ed data
10557 is stored in PTR_OUT, and the length of the data (not including
10558 the terminating zero) is stored in LEN_OUT.
10559
10560 void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out,
10561 LVALUE: Bytecount len_out,
10562 Internal_Format intfmt, Lisp_Object object);
10563 Like @code{eicpyout_alloca()}, but converts to the specified internal
10564 format. (No formats other than FORMAT_DEFAULT are currently
10565 implemented, and you get an assertion failure if you try.)
10566
10567 Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out);
10568 Make a @code{malloc()} copy of the data in the Eistring, using the
10569 default internal format. This is a real function. No lvalues
10570 passed in. Returns the new data, and stores the length (not
10571 including the terminating zero) using INTLEN_OUT, unless it's
10572 a NULL pointer.
10573
10574 Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt,
10575 Bytecount *intlen_out, Lisp_Object object);
10576 Like @code{eicpyout_malloc()}, but converts to the specified internal
10577 format. (No formats other than FORMAT_DEFAULT are currently
10578 implemented, and you get an assertion failure if you try.)
10579
10580
10581 **********************************************
10582 * Moving to the heap *
10583 **********************************************
10584
10585 void eito_malloc (Eistring *eistr);
10586 Move this Eistring to the heap. Its data will be stored in a
10587 @code{malloc()}ed block rather than the stack. Subsequent changes to
10588 this Eistring will @code{realloc()} the block as necessary. Use this
10589 when you want the Eistring to remain in scope past the end of
10590 this function call. You will have to manually free the data
10591 in the Eistring using @code{eifree()}.
10592
10593 void eito_alloca (Eistring *eistr);
10594 Move this Eistring back to the stack, if it was moved to the
10595 heap with @code{eito_malloc()}. This will automatically free any
10596 heap-allocated data.
10597
10598
10599
10600 **********************************************
10601 * Retrieving the length *
10602 **********************************************
10603
10604 Bytecount eilen (Eistring *eistr);
10605 Return the length of the internal data, in bytes. See also
10606 @code{eiextlen()}, below.
10607 Charcount eicharlen (Eistring *eistr);
10608 Return the length of the internal data, in characters.
10609
10610
10611 **********************************************
10612 * Working with positions *
10613 **********************************************
10614
10615 Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos);
10616 Convert a char offset to a byte offset.
10617 Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos);
10618 Convert a byte offset to a char offset.
10619 Bytecount eiincpos (Eistring *eistr, Bytecount bytepos);
10620 Increment the given position by one character.
10621 Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
10622 Increment the given position by N characters.
10623 Bytecount eidecpos (Eistring *eistr, Bytecount bytepos);
10624 Decrement the given position by one character.
10625 Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
10626 Deccrement the given position by N characters.
10627
10628
10629 **********************************************
10630 * Getting the character at a position *
10631 **********************************************
10632
10633 Ichar eigetch (Eistring *eistr, Bytecount bytepos);
10634 Return the character at a particular byte offset.
10635 Ichar eigetch_char (Eistring *eistr, Charcount charpos);
10636 Return the character at a particular character offset.
10637
10638
10639 **********************************************
10640 * Setting the character at a position *
10641 **********************************************
10642
10643 Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr);
10644 Set the character at a particular byte offset.
10645 Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr);
10646 Set the character at a particular character offset.
10647
10648
10649 **********************************************
10650 * Concatenation *
10651 **********************************************
10652
10653 void eicat_* (Eistring *eistr, ...);
10654 Concatenate onto the end of the Eistring, with data coming from the
10655 same places as above:
10656
10657 void eicat_ei (Eistring *eistr, Eistring *eistr2);
10658 ... from another Eistring.
10659 void eicat_c (Eistring *eistr, Ascbyte *c_string);
10660 ... from an ASCII null-terminated string. Non-ASCII characters in
10661 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
10662 void eicat_raw (ei, const Ibyte *data, Bytecount len);
10663 ... from raw internal-format data in the default internal format.
10664 void eicat_rawz (ei, const Ibyte *data);
10665 ... from raw internal-format data in the default internal format
10666 that is "null-terminated" (the meaning of this depends on the nature
10667 of the default internal format).
10668 void eicat_lstr (ei, Lisp_Object lisp_string);
10669 ... from a Lisp_Object string.
10670 void eicat_ch (ei, Ichar ch);
10671 ... from an Ichar.
10672
10673 All except the first variety are convenience functions.
10674 n the general case, create another Eistring from the source.)
10675
10676
10677 **********************************************
10678 * Replacement *
10679 **********************************************
10680
10681 void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff,
10682 Bytecount len, Charcount charlen, ...);
10683 Replace a section of the Eistring, specifically:
10684
10685 void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff,
10686 Bytecount len, Charcount charlen, Eistring *eistr2);
10687 ... with another Eistring.
10688 void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff,
10689 Bytecount len, Charcount charlen, Ascbyte *c_string);
10690 ... with an ASCII null-terminated string. Non-ASCII characters in
10691 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
10692 void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff,
10693 Bytecount len, Charcount charlen, Ichar ch);
10694 ... with an Ichar.
10695
10696 void eidel (Eistring *eistr, Bytecount off, Charcount charoff,
10697 Bytecount len, Charcount charlen);
10698 Delete a section of the Eistring.
10699
10700
10701 **********************************************
10702 * Converting to an external format *
10703 **********************************************
10704
10705 void eito_external (Eistring *eistr, Lisp_Object codesys);
10706 Convert the Eistring to an external format and store the result
10707 in the string. NOTE: Further changes to the Eistring will @strong{NOT}
10708 change the external data stored in the string. You will have to
10709 call @code{eito_external()} again in such a case if you want the external
10710 data.
10711
10712 Extbyte *eiextdata (Eistring *eistr);
10713 Return a pointer to the external data stored in the Eistring as
10714 a result of a prior call to @code{eito_external()}.
10715
10716 Bytecount eiextlen (Eistring *eistr);
10717 Return the length in bytes of the external data stored in the
10718 Eistring as a result of a prior call to @code{eito_external()}.
10719
10720
10721 **********************************************
10722 * Searching in the Eistring for a character *
10723 **********************************************
10724
10725 Bytecount eichr (Eistring *eistr, Ichar chr);
10726 Charcount eichr_char (Eistring *eistr, Ichar chr);
10727 Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off,
10728 Charcount charoff);
10729 Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
10730 Charcount charoff);
10731 Bytecount eirchr (Eistring *eistr, Ichar chr);
10732 Charcount eirchr_char (Eistring *eistr, Ichar chr);
10733 Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off,
10734 Charcount charoff);
10735 Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
10736 Charcount charoff);
10737
10738
10739 **********************************************
10740 * Searching in the Eistring for a string *
10741 **********************************************
10742
10743 Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2);
10744 Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2);
10745 Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
10746 Charcount charoff);
10747 Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2,
10748 Bytecount off, Charcount charoff);
10749 Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2);
10750 Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2);
10751 Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
10752 Charcount charoff);
10753 Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2,
10754 Bytecount off, Charcount charoff);
10755
10756 Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string);
10757 Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string);
10758 Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off,
10759 Charcount charoff);
10760 Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string,
10761 Bytecount off, Charcount charoff);
10762 Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string);
10763 Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string);
10764 Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string,
10765 Bytecount off, Charcount charoff);
10766 Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string,
10767 Bytecount off, Charcount charoff);
10768
10769
10770 **********************************************
10771 * Comparison *
10772 **********************************************
10773
10774 int eicmp_* (Eistring *eistr, ...);
10775 int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
10776 Bytecount len, Charcount charlen, ...);
10777 int eicasecmp_* (Eistring *eistr, ...);
10778 int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
10779 Bytecount len, Charcount charlen, ...);
10780 int eicasecmp_i18n_* (Eistring *eistr, ...);
10781 int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
10782 Bytecount len, Charcount charlen, ...);
10783
10784 Compare the Eistring with the other data. Return value same as
10785 from strcmp. The @code{*} is either @code{ei} for another Eistring (in
10786 which case @code{...} is an Eistring), or @code{c} for a pure-ASCII string
10787 (in which case @code{...} is a pointer to that string). For anything
10788 more complex, first create an Eistring out of the source.
10789 Comparison is either simple (@code{eicmp_...}), ASCII case-folding
10790 (@code{eicasecmp_...}), or multilingual case-folding
10791 (@code{eicasecmp_i18n_...}).
10792
10793
10794 More specifically, the prototypes are:
10795
10796 int eicmp_ei (Eistring *eistr, Eistring *eistr2);
10797 int eicmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
10798 Bytecount len, Charcount charlen, Eistring *eistr2);
10799 int eicasecmp_ei (Eistring *eistr, Eistring *eistr2);
10800 int eicasecmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
10801 Bytecount len, Charcount charlen, Eistring *eistr2);
10802 int eicasecmp_i18n_ei (Eistring *eistr, Eistring *eistr2);
10803 int eicasecmp_i18n_off_ei (Eistring *eistr, Bytecount off,
10804 Charcount charoff, Bytecount len,
10805 Charcount charlen, Eistring *eistr2);
10806
10807 int eicmp_c (Eistring *eistr, Ascbyte *c_string);
10808 int eicmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
10809 Bytecount len, Charcount charlen, Ascbyte *c_string);
10810 int eicasecmp_c (Eistring *eistr, Ascbyte *c_string);
10811 int eicasecmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
10812 Bytecount len, Charcount charlen,
10813 Ascbyte *c_string);
10814 int eicasecmp_i18n_c (Eistring *eistr, Ascbyte *c_string);
10815 int eicasecmp_i18n_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
10816 Bytecount len, Charcount charlen,
10817 Ascbyte *c_string);
10818
10819
10820 **********************************************
10821 * Case-changing the Eistring *
10822 **********************************************
10823
10824 void eilwr (Eistring *eistr);
10825 Convert all characters in the Eistring to lowercase.
10826 void eiupr (Eistring *eistr);
10827 Convert all characters in the Eistring to uppercase.
10828 @end example
10829
10830 @node Coding for Mule, CCL, Internal Text API's, Multilingual Support
10831 @section Coding for Mule
10832 @cindex coding for Mule
10833 @cindex Mule, coding for
10834
10835 Although Mule support is not compiled by default in XEmacs, many people
10836 are using it, and we consider it crucial that new code works correctly
10837 with multibyte characters. This is not hard; it is only a matter of
10838 following several simple user-interface guidelines. Even if you never
10839 compile with Mule, with a little practice you will find it quite easy
10840 to code Mule-correctly.
10841
10842 Note that these guidelines are not necessarily tied to the current Mule
10843 implementation; they are also a good idea to follow on the grounds of
10844 code generalization for future I18N work.
10845
10846 @menu
10847 * Character-Related Data Types::
10848 * Working With Character and Byte Positions::
10849 * Conversion to and from External Data::
10850 * General Guidelines for Writing Mule-Aware Code::
10851 * An Example of Mule-Aware Code::
10852 * Mule-izing Code::
10853 @end menu
10854
10855 @node Character-Related Data Types, Working With Character and Byte Positions, Coding for Mule, Coding for Mule
10856 @subsection Character-Related Data Types
10857 @cindex character-related data types
10858 @cindex data types, character-related
10859
10860 First, let's review the basic character-related datatypes used by
10861 XEmacs. Note that some of the separate @code{typedef}s are not
10862 mandatory, but they improve clarity of code a great deal, because one
10863 glance at the declaration can tell the intended use of the variable.
10864
10865 @table @code
10866 @item Ichar
10867 @cindex Ichar
10868 An @code{Ichar} holds a single Emacs character.
10869
10870 Obviously, the equality between characters and bytes is lost in the Mule
10871 world. Characters can be represented by one or more bytes in the
10872 buffer, and @code{Ichar} is a C type large enough to hold any
10873 character. (This currently isn't quite true for ISO 10646, which
10874 defines a character as a 31-bit non-negative quantity, while XEmacs
10875 characters are only 30-bits. This is irrelevant, unless you are
10876 considering using the ISO 10646 private groups to support really large
10877 private character sets---in particular, the Mule character set!---in
10878 a version of XEmacs using Unicode internally.)
10879
10880 Without Mule support, an @code{Ichar} is equivalent to an
10881 @code{unsigned char}. [[This doesn't seem to be true; @file{lisp.h}
10882 unconditionally @samp{typedef}s @code{Ichar} to @code{int}.]]
10883
10884 @item Ibyte
10885 @cindex Ibyte
10886 The data representing the text in a buffer or string is logically a set
10887 of @code{Ibyte}s.
10888
10889 XEmacs does not work with the same character formats all the time; when
10890 reading characters from the outside, it decodes them to an internal
10891 format, and likewise encodes them when writing. @code{Ibyte} (in fact
10892 @code{unsigned char}) is the basic unit of XEmacs internal buffers and
10893 strings format. An @code{Ibyte *} is the type that points at text
10894 encoded in the variable-width internal encoding.
10895
10896 One character can correspond to one or more @code{Ibyte}s. In the
10897 current Mule implementation, an ASCII character is represented by the
10898 same @code{Ibyte}, and other characters are represented by a sequence
10899 of two or more @code{Ibyte}s. (This will also be true of an
10900 implementation using UTF-8 as the internal encoding. In fact, only code
10901 that implements character code conversions and a very few macros used to
10902 implement motion by whole characters will notice the difference between
10903 UTF-8 and the Mule encoding.)
10904
10905 Without Mule support, there are exactly 256 characters, implicitly
10906 Latin-1, and each character is represented using one @code{Ibyte}, and
10907 there is a one-to-one correspondence between @code{Ibyte}s and
10908 @code{Ichar}s.
10909
10910 @item Charxpos
10911 @item Charbpos
10912 @itemx Charcount
10913 @cindex Charxpos
10914 @cindex Charbpos
10915 @cindex Charcount
10916 A @code{Charbpos} represents a character position in a buffer. A
10917 @code{Charcount} represents a number (count) of characters. Logically,
10918 subtracting two @code{Charbpos} values yields a @code{Charcount} value.
10919 When representing a character position in a string, we just use
10920 @code{Charcount} directly. The reason for having a separate typedef for
10921 buffer positions is that they are 1-based, whereas string positions are
10922 0-based and hence string counts and positions can be freely intermixed (a
10923 string position is equivalent to the count of characters from the
10924 beginning). When representing a character position that could be either
10925 in a buffer or string (for example, in the extent code), @code{Charxpos}
10926 is used. Although all of these are @code{typedef}ed to
10927 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make
10928 it clear what sort of position is being used.
10929
10930 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the
10931 only ones that are ever visible to Lisp.
10932
10933 @item Bytexpos
10934 @itemx Bytecount
10935 @cindex Bytebpos
10936 @cindex Bytecount
10937 A @code{Bytebpos} represents a byte position in a buffer. A
10938 @code{Bytecount} represents the distance between two positions, in
10939 bytes. Byte positions in strings use @code{Bytecount}, and for byte
10940 positions that can be either in a buffer or string, @code{Bytexpos} is
10941 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and
10942 @code{Bytecount} is the same as the relationship between
10943 @code{Charxpos}, @code{Charbpos} and @code{Charcount}.
10944
10945 @item Extbyte
10946 @cindex Extbyte
10947 When dealing with the outside world, XEmacs works with @code{Extbyte}s,
10948 which are equivalent to @code{char}. The distance between two
10949 @code{Extbyte}s is a @code{Bytecount}, since external text is a
10950 byte-by-byte encoding. Extbytes occur mainly at the transition point
10951 between internal text and external functions. XEmacs code should not,
10952 if it can possibly avoid it, do any actual manipulation using external
10953 text, since its format is completely unpredictable (it might not even be
10954 ASCII-compatible).
10955 @end table
10956
10957 @node Working With Character and Byte Positions, Conversion to and from External Data, Character-Related Data Types, Coding for Mule
10958 @subsection Working With Character and Byte Positions
10959 @cindex character and byte positions, working with
10960 @cindex byte positions, working with character and
10961 @cindex positions, working with character and byte
10962
10963 Now that we have defined the basic character-related types, we can look
10964 at the macros and functions designed for work with them and for
10965 conversion between them. Most of these macros are defined in
10966 @file{buffer.h}, and we don't discuss all of them here, but only the
10967 most important ones. Examining the existing code is the best way to
10968 learn about them.
10969
10970 @table @code
10971 @item MAX_ICHAR_LEN
10972 @cindex MAX_ICHAR_LEN
10973 This preprocessor constant is the maximum number of buffer bytes to
10974 represent an Emacs character in the variable width internal encoding.
10975 It is useful when allocating temporary strings to keep a known number of
10976 characters. For instance:
10977
10978 @example
10979 @group
10980 @{
10981 Charcount cclen;
10982 ...
10983 @{
10984 /* Allocate place for @var{cclen} characters. */
10985 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN);
10986 ...
10987 @end group
10988 @end example
10989
10990 If you followed the previous section, you can guess that, logically,
10991 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces
10992 a @code{Bytecount} value.
10993
10994 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4.
10995 Without Mule, it is 1. In a mature Unicode-based XEmacs, it will also
10996 be 4 (since all Unicode characters can be encoded in UTF-8 in 4 bytes or
10997 less), but some versions may use up to 6, in order to use the large
10998 private space provided by ISO 10646 to ``mirror'' the Mule code space.
10999
11000 @item itext_ichar
11001 @itemx set_itext_ichar
11002 @cindex itext_ichar
11003 @cindex set_itext_ichar
11004 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and
11005 returns the @code{Ichar} stored at that position. If it were a
11006 function, its prototype would be:
11007
11008 @example
11009 Ichar itext_ichar (Ibyte *p);
11010 @end example
11011
11012 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte
11013 position. It returns the number of bytes stored:
11014
11015 @example
11016 Bytecount set_itext_ichar (Ibyte *p, Ichar c);
11017 @end example
11018
11019 It is important to note that @code{set_itext_ichar} is safe only for
11020 appending a character at the end of a buffer, not for overwriting a
11021 character in the middle. This is because the width of characters
11022 varies, and @code{set_itext_ichar} cannot resize the string if it
11023 writes, say, a two-byte character where a single-byte character used to
11024 reside.
11025
11026 A typical use of @code{set_itext_ichar} can be demonstrated by this
11027 example, which copies characters from buffer @var{buf} to a temporary
11028 string of Ibytes.
11029
11030 @example
11031 @group
11032 @{
11033 Charbpos pos;
11034 for (pos = beg; pos < end; pos++)
11035 @{
11036 Ichar c = BUF_FETCH_CHAR (buf, pos);
11037 p += set_itext_ichar (buf, c);
11038 @}
11039 @}
11040 @end group
11041 @end example
11042
11043 Note how @code{set_itext_ichar} is used to store the @code{Ichar}
11044 and increment the counter, at the same time.
11045
11046 @item INC_IBYTEPTR
11047 @itemx DEC_IBYTEPTR
11048 @cindex INC_IBYTEPTR
11049 @cindex DEC_IBYTEPTR
11050 These two macros increment and decrement an @code{Ibyte} pointer,
11051 respectively. They will adjust the pointer by the appropriate number of
11052 bytes according to the byte length of the character stored there. Both
11053 macros assume that the memory address is located at the beginning of a
11054 valid character.
11055
11056 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)}
11057 simply expand to @code{p++} and @code{p--}, respectively.
11058
11059 @item bytecount_to_charcount
11060 @cindex bytecount_to_charcount
11061 Given a pointer to a text string and a length in bytes, return the
11062 equivalent length in characters.
11063
11064 @example
11065 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc);
11066 @end example
11067
11068 @item charcount_to_bytecount
11069 @cindex charcount_to_bytecount
11070 Given a pointer to a text string and a length in characters, return the
11071 equivalent length in bytes.
11072
11073 @example
11074 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc);
11075 @end example
11076
11077 @item itext_n_addr
11078 @cindex itext_n_addr
11079 Return a pointer to the beginning of the character offset @var{cc} (in
11080 characters) from @var{p}.
11081
11082 @example
11083 Ibyte *itext_n_addr (Ibyte *p, Charcount cc);
11084 @end example
11085 @end table
11086
11087 @node Conversion to and from External Data, General Guidelines for Writing Mule-Aware Code, Working With Character and Byte Positions, Coding for Mule
11088 @subsection Conversion to and from External Data
11089 @cindex conversion to and from external data
11090 @cindex external data, conversion to and from
11091
11092 When an external function, such as a C library function, returns a
11093 @code{char} pointer, you should almost never treat it as @code{Ibyte}.
11094 This is because these returned strings may contain 8bit characters which
11095 can be misinterpreted by XEmacs, and cause a crash. Likewise, when
11096 exporting a piece of internal text to the outside world, you should
11097 always convert it to an appropriate external encoding, lest the internal
11098 stuff (such as the infamous \201 characters) leak out.
11099
11100 The interface to conversion between the internal and external
11101 representations of text are the numerous conversion macros defined in
11102 @file{buffer.h}. There used to be a fixed set of external formats
11103 supported by these macros, but now any coding system can be used with
11104 them. The coding system alias mechanism is used to create the
11105 following logical coding systems, which replace the fixed external
11106 formats. The (dontusethis-set-symbol-value-handler) mechanism was
11107 enhanced to make this possible (more work on that is needed).
11108
11109 Often useful coding systems:
11110
11111 @table @code
11112 @item Qbinary
11113 This is the simplest format and is what we use in the absence of a more
11114 appropriate format. This converts according to the @code{binary} coding
11115 system:
11116
11117 @enumerate a
11118 @item
11119 On input, bytes 0--255 are converted into (implicitly Latin-1)
11120 characters 0--255. A non-Mule xemacs doesn't really know about
11121 different character sets and the fonts to display them, so the bytes can
11122 be treated as text in different 1-byte encodings by simply setting the
11123 appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual
11124 editor if, for example, different fonts are used to display text in
11125 different buffers, faces, or windows. The specifier mechanism gives the
11126 user complete control over this kind of behavior.
11127 @item
11128 On output, characters 0--255 are converted into bytes 0--255 and other
11129 characters are converted into @samp{~}.
11130 @end enumerate
11131
11132 @item Qnative
11133 Format used for the external Unix environment---@code{argv[]}, stuff
11134 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc.
11135 This is encoded according to the encoding specified by the current locale.
11136 [[This is dangerous; current locale is user preference, and the system
11137 is probably going to be something else. Is there anything we can do
11138 about it?]]
11139
11140 @item Qfile_name
11141 Format used for filenames. This is normally the same as @code{Qnative},
11142 but the two should be distinguished for clarity and possible future
11143 separation -- and also because @code{Qfile_name} can be changed using either
11144 the @code{file-name-coding-system} or @code{pathname-coding-system} (now
11145 obsolete) variables.
11146
11147 @item Qctext
11148 Compound-text format. This is the standard X11 format used for data
11149 stored in properties, selections, and the like. This is an 8-bit
11150 no-lock-shift ISO2022 coding system. This is a real coding system,
11151 unlike @code{Qfile_name}, which is user-definable.
11152
11153 @item Qmswindows_tstr
11154 Used for external data in all MS Windows functions that are declared to
11155 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either
11156 @code{Qmswindows_multibyte} (a locale-specific encoding, same as
11157 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether
11158 XEmacs is being run under Windows 9X or Windows NT/2000/XP.
11159 @end table
11160
11161 Many other coding systems are provided by default.
11162
11163 There are two fundamental macros to convert between external and
11164 internal format, as well as various convenience macros to simplify the
11165 most common operations.
11166
11167 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and
11168 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments
11169 each of these receives are a source type, a source, a sink type, a sink,
11170 and a coding system (or a symbol naming a coding system).
11171
11172 A typical call looks like
11173 @example
11174 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);
11175 @end example
11176
11177 which means that the contents of the lisp string @code{str} are written
11178 to a malloc'ed memory area which will be pointed to by @code{ptr}, after
11179 the function returns. The conversion will be done using the
11180 @code{file-name} coding system, which will be controlled by the user
11181 indirectly by setting or binding the variable
11182 @code{file-name-coding-system}.
11183
11184 Some sources and sinks require two C variables to specify. We use some
11185 preprocessor magic to allow different source and sink types, and even
11186 different numbers of arguments to specify different types of sources and
11187 sinks.
11188
11189 So we can have a call that looks like
11190 @example
11191 TO_INTERNAL_FORMAT (DATA, (ptr, len),
11192 MALLOC, (ptr, len),
11193 coding_system);
11194 @end example
11195
11196 The parenthesized argument pairs are required to make the preprocessor
11197 magic work.
11198
11199 Here are the different source and sink types:
11200
11201 @table @code
11202 @item @code{DATA, (ptr, len),}
11203 input data is a fixed buffer of size @var{len} at address @var{ptr}
11204 @item @code{ALLOCA, (ptr, len),}
11205 output data is placed in an @code{alloca()}ed buffer of size @var{len} pointed to by @var{ptr}
11206 @item @code{MALLOC, (ptr, len),}
11207 output data is in a @code{malloc()}ed buffer of size @var{len} pointed to by @var{ptr}
11208 @item @code{C_STRING_ALLOCA, ptr,}
11209 equivalent to @code{ALLOCA (ptr, len_ignored)} on output.
11210 @item @code{C_STRING_MALLOC, ptr,}
11211 equivalent to @code{MALLOC (ptr, len_ignored)} on output
11212 @item @code{C_STRING, ptr,}
11213 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input
11214 @item @code{LISP_STRING, string,}
11215 input or output is a Lisp_Object of type string
11216 @item @code{LISP_BUFFER, buffer,}
11217 output is written to @code{(point)} in lisp buffer @var{buffer}
11218 @item @code{LISP_LSTREAM, lstream,}
11219 input or output is a Lisp_Object of type lstream
11220 @item @code{LISP_OPAQUE, object,}
11221 input or output is a Lisp_Object of type opaque
11222 @end table
11223
11224 A source type of @code{C_STRING} or a sink type of
11225 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where
11226 the external API is not '\0'-byte-clean -- i.e. it expects strings to be
11227 terminated with a null byte. For external API's that are in fact
11228 '\0'-byte-clean, we should of course not use these.
11229
11230 The sinks to be specified must be lvalues, unless they are the lisp
11231 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}.
11232
11233 There is no problem using the same lvalue for source and sink.
11234
11235 Garbage collection is inhibited during these conversion operations, so
11236 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}.
11237
11238 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the
11239 resulting text is stored in a stack-allocated buffer, which is
11240 automatically freed on returning from the function. However, the sink
11241 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed
11242 memory. The caller is responsible for freeing this memory using
11243 @code{xfree()}.
11244
11245 Note that it doesn't make sense for @code{LISP_STRING} to be a source
11246 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}.
11247 You'll get an assertion failure if you try.
11248
11249 99% of conversions involve raw data or Lisp strings as both source and
11250 sink, and usually data is output as @code{alloca()}, or sometimes
11251 @code{xmalloc()}. For this reason, convenience macros are defined for
11252 many types of conversions involving raw data and/or Lisp strings,
11253 especially when the output is an @code{alloca()}ed string. (When the
11254 destination is a Lisp string, there are other functions that should be
11255 used instead -- @code{build_ext_string()} and @code{make_ext_string()},
11256 for example.) The convenience macros are of two types -- the older kind
11257 that store the result into a specified variable, and the newer kind that
11258 return the result. The newer kind of macros don't exist when the output
11259 is sized data, because that would have two return values. NOTE: All
11260 convenience macros are ultimately defined in terms of
11261 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any
11262 comments above about the workings of these macros also apply to all
11263 convenience macros.
11264
11265 A typical old-style convenience macro is
11266
11267 @example
11268 C_STRING_TO_EXTERNAL (in, out, codesys);
11269 @end example
11270
11271 This is equivalent to
11272
11273 @example
11274 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys);
11275 @end example
11276
11277 but is easier to write and somewhat clearer, since it clearly identifies
11278 the arguments without the clutter of having the preprocessor types mixed
11279 in.
11280
11281 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src,
11282 codesys)}, which @emph{returns} the converted data (still in
11283 @code{alloca()} space). This is far more convenient for most
11284 operations.
11285
11286 @node General Guidelines for Writing Mule-Aware Code, An Example of Mule-Aware Code, Conversion to and from External Data, Coding for Mule
11287 @subsection General Guidelines for Writing Mule-Aware Code
11288 @cindex writing Mule-aware code, general guidelines for
11289 @cindex Mule-aware code, general guidelines for writing
11290 @cindex code, general guidelines for writing Mule-aware
11291
11292 This section contains some general guidance on how to write Mule-aware
11293 code, as well as some pitfalls you should avoid.
11294
11295 @table @emph
11296 @item Never use @code{char} and @code{char *}.
11297 In XEmacs, the use of @code{char} and @code{char *} is almost always a
11298 mistake. If you want to manipulate an Emacs character from ``C'', use
11299 @code{Ichar}. If you want to examine a specific octet in the internal
11300 format, use @code{Ibyte}. If you want a Lisp-visible character, use a
11301 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move
11302 through the internal text, use @code{Ibyte *}. Also note that you
11303 almost certainly do not need @code{Ichar *}. Other typedefs to clarify
11304 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary},
11305 @code{UChar_Binary}, and @code{CIbyte}.
11306
11307 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}.
11308 The whole point of using different types is to avoid confusion about the
11309 use of certain variables. Lest this effect be nullified, you need to be
11310 careful about using the right types.
11311
11312 @item Always convert external data
11313 It is extremely important to always convert external data, because
11314 XEmacs can crash if unexpected 8-bit sequences are copied to its internal
11315 buffers literally.
11316
11317 This means that when a system function, such as @code{readdir}, returns
11318 a string, you normally need to convert it using one of the conversion macros
11319 described in the previous chapter, before passing it further to Lisp.
11320
11321 Actually, most of the basic system functions that accept '\0'-terminated
11322 string arguments, like @code{stat()} and @code{open()}, have
11323 @strong{encapsulated} equivalents that do the internal to external
11324 conversion themselves. The encapsulated equivalents have a @code{qxe_}
11325 prefix and have string arguments of type @code{Ibyte *}, and you can
11326 pass internally encoded data to them, often from a Lisp string using
11327 @code{XSTRING_DATA}. (A better design might be to provide versions that
11328 accept Lisp strings directly.) [[Really? Then they'd either take
11329 @code{Lisp_Object}s and need to check type, or they'd take
11330 @code{Lisp_String}s, and violate the rules about passing any of the
11331 specific Lisp types.]]
11332
11333 Also note that many internal functions, such as @code{make_string},
11334 accept Ibytes, which removes the need for them to convert the data they
11335 receive. This increases efficiency because that way external data needs
11336 to be decoded only once, when it is read. After that, it is passed
11337 around in internal format.
11338
11339 @item Do all work in internal format
11340 External-formatted data is completely unpredictable in its format. It
11341 may be fixed-width Unicode (not even ASCII compatible); it may be a
11342 modal encoding, in
11343 which case some occurrences of (e.g.) the slash character may be part of
11344 two-byte Asian-language characters, and a naive attempt to split apart a
11345 pathname by slashes will fail; etc. Internal-format text should be
11346 converted to external format only at the point where an external API is
11347 actually called, and the first thing done after receiving
11348 external-format text from an external API should be to convert it to
11349 internal text.
11350 @end table
11351
11352 @node An Example of Mule-Aware Code, Mule-izing Code, General Guidelines for Writing Mule-Aware Code, Coding for Mule
11353 @subsection An Example of Mule-Aware Code
11354 @cindex code, an example of Mule-aware
11355 @cindex Mule-aware code, an example of
11356
11357 As an example of Mule-aware code, we will analyze the @code{string}
11358 function, which conses up a Lisp string from the character arguments it
11359 receives. Here is the definition, pasted from @code{alloc.c}:
11360
11361 @example
11362 @group
11363 DEFUN ("string", Fstring, 0, MANY, 0, /*
11364 Concatenate all the argument characters and make the result a string.
11365 */
11366 (int nargs, Lisp_Object *args))
11367 @{
11368 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN);
11369 Ibyte *p = storage;
11370
11371 for (; nargs; nargs--, args++)
11372 @{
11373 Lisp_Object lisp_char = *args;
11374 CHECK_CHAR_COERCE_INT (lisp_char);
11375 p += set_itext_ichar (p, XCHAR (lisp_char));
11376 @}
11377 return make_string (storage, p - storage);
11378 @}
11379 @end group
11380 @end example
11381
11382 Now we can analyze the source line by line.
11383
11384 Obviously, string will be as long as there are arguments to the
11385 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs}
11386 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs}
11387 @code{Ichar}s to fit in the string.
11388
11389 Then, the loop checks that each element is a character, converting
11390 integers in the process. Like many other functions in XEmacs, this
11391 function silently accepts integers where characters are expected, for
11392 historical and compatibility reasons. Unless you know what you are
11393 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)}
11394 extracts the @code{Ichar} from the @code{Lisp_Object}, and
11395 @code{set_itext_ichar} stores it to storage, increasing @code{p} in
11396 the process.
11397
11398 Other instructive examples of correct coding under Mule can be found all
11399 over the XEmacs code. For starters, I recommend
11400 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have
11401 understood this section of the manual and studied the examples, you can
11402 proceed writing new Mule-aware code.
11403
11404 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule
11405 @subsection Mule-izing Code
11406
11407 A lot of code is written without Mule in mind, and needs to be made
11408 Mule-correct or "Mule-ized". There is really no substitute for
11409 line-by-line analysis when doing this, but the following checklist can
11410 help:
11411
11412 @itemize @bullet
11413 @item
11414 Check all uses of @code{XSTRING_DATA}.
11415 @item
11416 Check all uses of @code{build_string} and @code{make_string}.
11417 @item
11418 Check all uses of @code{tolower} and @code{toupper}.
11419 @item
11420 Check object print methods.
11421 @item
11422 Check for use of functions such as @code{write_c_string},
11423 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}.
11424 @item
11425 Check all occurrences of @code{char} and correct to one of the other
11426 typedefs described above.
11427 @item
11428 Check all existing uses of @code{TO_EXTERNAL_FORMAT},
11429 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for
11430 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}).
11431 @item
11432 In Windows code, string literals may need to be encapsulated with @code{XETEXT}.
11433 @end itemize
11434
11435 @node CCL, Microsoft Windows-Related Multilingual Issues, Coding for Mule, Multilingual Support
11436 @section CCL
11437 @cindex CCL
11438
11439 @example
11440 MACHINE CODE:
11441
11442 The machine code consists of a vector of 32-bit words.
11443 The first such word specifies the start of the EOF section of the code;
11444 this is the code executed to handle any stuff that needs to be done
11445 (e.g. designating back to ASCII and left-to-right mode) after all
11446 other encoded/decoded data has been written out. This is not used for
11447 charset CCL programs.
11448
11449 REGISTER: 0..7 -- referred by RRR or rrr
11450
11451 OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT
11452 TTTTT (5-bit): operator type
11453 RRR (3-bit): register number
11454 XXXXXXXXXXXXXXXX (15-bit):
11455 CCCCCCCCCCCCCCC: constant or address
11456 000000000000rrr: register number
11457
11458 AAAA: 00000 +
11459 00001 -
11460 00010 *
11461 00011 /
11462 00100 %
11463 00101 &
11464 00110 |
11465 00111 ~
11466
11467 01000 <<
11468 01001 >>
11469 01010 <8
11470 01011 >8
11471 01100 //
11472 01101 not used
11473 01110 not used
11474 01111 not used
11475
11476 10000 <
11477 10001 >
11478 10010 ==
11479 10011 <=
11480 10100 >=
11481 10101 !=
11482
11483 OPERATORS: TTTTT RRR XX..
11484
11485 SetCS: 00000 RRR C...C RRR = C...C
11486 SetCL: 00001 RRR ..... RRR = c...c
11487 c.............c
11488 SetR: 00010 RRR ..rrr RRR = rrr
11489 SetA: 00011 RRR ..rrr RRR = array[rrr]
11490 C.............C size of array = C...C
11491 c.............c contents = c...c
11492
11493 Jump: 00100 000 c...c jump to c...c
11494 JumpCond: 00101 RRR c...c if (!RRR) jump to c...c
11495 WriteJump: 00110 RRR c...c Write1 RRR, jump to c...c
11496 WriteReadJump: 00111 RRR c...c Write1, Read1 RRR, jump to c...c
11497 WriteCJump: 01000 000 c...c Write1 C...C, jump to c...c
11498 C...C
11499 WriteCReadJump: 01001 RRR c...c Write1 C...C, Read1 RRR,
11500 C.............C and jump to c...c
11501 WriteSJump: 01010 000 c...c WriteS, jump to c...c
11502 C.............C
11503 S.............S
11504 ...
11505 WriteSReadJump: 01011 RRR c...c WriteS, Read1 RRR, jump to c...c
11506 C.............C
11507 S.............S
11508 ...
11509 WriteAReadJump: 01100 RRR c...c WriteA, Read1 RRR, jump to c...c
11510 C.............C size of array = C...C
11511 c.............c contents = c...c
11512 ...
11513 Branch: 01101 RRR C...C if (RRR >= 0 && RRR < C..)
11514 c.............c branch to (RRR+1)th address
11515 Read1: 01110 RRR ... read 1-byte to RRR
11516 Read2: 01111 RRR ..rrr read 2-byte to RRR and rrr
11517 ReadBranch: 10000 RRR C...C Read1 and Branch
11518 c.............c
11519 ...
11520 Write1: 10001 RRR ..... write 1-byte RRR
11521 Write2: 10010 RRR ..rrr write 2-byte RRR and rrr
11522 WriteC: 10011 000 ..... write 1-char C...CC
11523 C.............C
11524 WriteS: 10100 000 ..... write C..-byte of string
11525 C.............C
11526 S.............S
11527 ...
11528 WriteA: 10101 RRR ..... write array[RRR]
11529 C.............C size of array = C...C
11530 c.............c contents = c...c
11531 ...
11532 End: 10110 000 ..... terminate the execution
11533
11534 SetSelfCS: 10111 RRR C...C RRR AAAAA= C...C
11535 ..........AAAAA
11536 SetSelfCL: 11000 RRR ..... RRR AAAAA= c...c
11537 c.............c
11538 ..........AAAAA
11539 SetSelfR: 11001 RRR ..Rrr RRR AAAAA= rrr
11540 ..........AAAAA
11541 SetExprCL: 11010 RRR ..Rrr RRR = rrr AAAAA c...c
11542 c.............c
11543 ..........AAAAA
11544 SetExprR: 11011 RRR ..rrr RRR = rrr AAAAA Rrr
11545 ............Rrr
11546 ..........AAAAA
11547 JumpCondC: 11100 RRR c...c if !(RRR AAAAA C..) jump to c...c
11548 C.............C
11549 ..........AAAAA
11550 JumpCondR: 11101 RRR c...c if !(RRR AAAAA rrr) jump to c...c
11551 ............rrr
11552 ..........AAAAA
11553 ReadJumpCondC: 11110 RRR c...c Read1 and JumpCondC
11554 C.............C
11555 ..........AAAAA
11556 ReadJumpCondR: 11111 RRR c...c Read1 and JumpCondR
11557 ............rrr
11558 ..........AAAAA
11559 @end example
11560
11561 @node Microsoft Windows-Related Multilingual Issues, Modules for Internationalization, CCL, Multilingual Support
11562 @section Microsoft Windows-Related Multilingual Issues
11563 @cindex Microsoft Windows-related multilingual issues
11564 @cindex Windows-related multilingual issues
11565 @cindex multilingual issues, Windows-related
11566
11567 @menu
11568 * Microsoft Documentation::
11569 * Locales::
11570 * More about code pages::
11571 * More about locales::
11572 * Unicode support under Windows::
11573 * The golden rules of writing Unicode-safe code::
11574 * The format of the locale in setlocale()::
11575 * Random other Windows I18N docs::
11576 @end menu
11577
11578 @node Microsoft Documentation, Locales, Microsoft Windows-Related Multilingual Issues, Microsoft Windows-Related Multilingual Issues
11579 @subsection Microsoft Documentation
11580 @cindex Microsoft documentation
11581
11582 Documentation on international support in Windows is scattered throughout MSDN.
11583 Here are some good places to look:
11584
11585 @enumerate
11586 @item
11587 C Runtime (CRT) intl support
11588
11589 @enumerate
11590 @item
11591 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Internationalization
11592 @item
11593 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Global Constants -> Locale Categories
11594 @item
11595 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Language and Country/Region Strings
11596 @item
11597 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Generic-Text Mappings
11598 @item
11599 Function documentation for various functions:
11600 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Alphabetic Function Reference
11601 e.g. _setmbcp(), setlocale(), strcoll functions
11602 @end enumerate
11603
11604 @item
11605 Win32 API intl support
11606
11607 @enumerate
11608 @item
11609 Platform SDK Documentation -> Base Services -> International Features
11610 @item
11611 Platform SDK Documentation -> User Interface Services -> Windows User Interface -> User Input -> Keyboard Input -> Character Messages -> International Features
11612 @item
11613 Backgrounders -> Windows Platform -> Windows 2000 -> International Support in Microsoft Windows 2000
11614 @end enumerate
11615
11616 @item
11617 Microsoft Layer for Unicode
11618
11619 Platform SDK Documentation -> Windows API -> Windows 95/98/Me Programming -> Windows 95/98/Me Overviews -> Microsoft Layer for Unicode on Windows 95/98/Me Systems
11620
11621 @item
11622 Look in the CRT sources! They come with VC++. See win32.c.
11623 @end enumerate
11624
11625 @node Locales, More about code pages, Microsoft Documentation, Microsoft Windows-Related Multilingual Issues
11626 @subsection Locales, code pages, and other concepts of "language"
11627 @cindex locales, code pages, and other concepts of "language"
11628
11629 First, make sure you clearly understand the difference between the C
11630 runtime library (CRT) and the Win32 API! See win32.c.
11631
11632 There are various different ways of representing the vague concept
11633 of "language", and it can be very confusing. So:
11634
11635 @itemize @bullet
11636 @item
11637 The CRT library has the concept of "locale", which is a
11638 combination of language and country, and which controls the way
11639 currency and dates are displayed, the encoding of data, etc.
11640
11641 @item
11642 XEmacs has the concept of "language environment", more or less
11643 like a locale; although currently in most cases it just refers to
11644 the language, and no sub-language distinctions are
11645 made. (Exceptions are with Chinese, which has different language
11646 environments for Taiwan and mainland China, due to the different
11647 encodings and writing systems.)
11648
11649 @item
11650 Windows has a number of different language concepts:
11651
11652 @enumerate
11653 @item
11654 There are "languages" and "sublanguages", which correspond to
11655 the languages and countries of the C library -- e.g. LANG_ENGLISH
11656 and SUBLANG_ENGLISH_US. These are identified by 8-bit integers,
11657 called the "primary language identifier" and "sublanguage
11658 identifier", respectively. These are combined into a 16-bit
11659 integer or "language identifier" by MAKELANGID().
11660
11661 @item
11662 The language identifier in turn is combined with a "sort
11663 identifier" (and optionally a "sort version") to yield a 32-bit
11664 integer called a "locale identifier" (type LCID), which identifies
11665 locales -- the primary means of distinguishing language/regional
11666 settings and similar to C library locales.
11667
11668 @item
11669 A "code page" combines the XEmacs concepts of "charset" and "coding
11670 system". It logically encompasses
11671
11672 @itemize @minus
11673 @item
11674 a set of supported characters
11675 @item
11676 an enumeration associating each character with a code point, which
11677 is a number or number pair; there may be disjoint ranges of numbers
11678 supported
11679 @item
11680 a way of encoding a series of characters into a string of bytes
11681 @end itemize
11682
11683 Note that the first two properties correspond to an XEmacs "charset"
11684 and the latter an XEmacs "coding system".
11685
11686 Traditional encodings are either simple one-byte encodings, or
11687 combination one-byte/two-byte encodings (aka MBCS encodings, where MBCS
11688 stands for "Multibyte Character Set") with the following properties:
11689
11690 @itemize @minus
11691 @item
11692 all characters are encoded as a one-byte or two-byte sequence
11693 @item
11694 the encoding is stateless (non-modal)
11695 @item
11696 the lower 128 bytes are compatible with ASCII
11697 @item
11698 in the higher bytes, the value of the first byte ("lead byte")
11699 determines whether a second byte follows
11700 @item
11701 the values used for second bytes may overlap those used for first
11702 bytes, and (in some encodings) include values in the low half; thus,
11703 moving backwards is hard, and pure-ASCII algorithms (e.g. finding the
11704 next slash) will fail unless rewritten to be MBCS-aware (neither of
11705 these problems exist in UTF-8 or in the XEmacs internal string
11706 encoding)
11707 @end itemize
11708
11709 Recent code pages, however, do not necessarily follow these properties --
11710 code pages have been expanded to include arbitrary encodings, such as
11711 UTF-8 (may have more than two bytes per character) and ISO-2022-JP
11712 (complex modal encoding).
11713
11714 @item
11715 Every Windows locale has four associated code pages: ANSI (an
11716 international standard or some Microsoft-created approximation; the
11717 native code page under Windows), OEM (a DOS encoding, still used in the
11718 FAT file system), Mac (an encoding used on the Macintosh) and EBCDIC (a
11719 non-ASCII-compatible encoding used on IBM mainframes, originally based
11720 on the BCD or "binary-coded decimal" encoding of numbers). All code
11721 pages associated with a locale follow (as far as I know) the properties
11722 listed above for traditional code pages. More than one locale can share
11723 a code page -- e.g. all the Western European languages, including
11724 English, do.
11725
11726 @item
11727 Windows also has an "input locale identifier" (aka "keyboard
11728 layout id") or HKL, which is a 32-bit integer composed of the
11729 16-bit language identifier and a 16-bit "device identifier", which
11730 originally specified a particular keyboard layout (e.g. the locale
11731 "US English" can have the QWERTY layout, the Dvorak layout, etc.),
11732 but has been expanded to include speech-to-text converters and
11733 other non-keyboard ways of inputting text. Note that both the HKL
11734 and LCID share the language identifier in the lower 16 bits, and in
11735 both cases a 0 in the upper 16 bits means "default" (sort order or
11736 device), providing a way to convert between HKL's, LCID's, and
11737 language identifiers (i.e. language/sublanguage pairs). The
11738 default keyboard layout for a language is (as far as I can
11739 determine) established using the Regional Settings control panel
11740 applet, where you can add input locales as combinations of language
11741 (actually language/sublanguage) and layout; presumably if you list
11742 only one input locale with a particular language, the corresponding
11743 layout is the default for that language. But what if you list more
11744 than one? You can specify a single default input locale, but there
11745 appears to be no way to do so on a per-language basis.
11746 @end enumerate
11747 @end itemize
11748
11749 @node More about code pages, More about locales, Locales, Microsoft Windows-Related Multilingual Issues
11750 @subsection More about code pages
11751 @cindex more about code pages
11752
11753 Here is what MSDN says about code pages (article "Code Pages"):
11754
11755 @quotation
11756 A code page is a character set, which can include numbers,
11757 punctuation marks, and other glyphs. Different languages and locales
11758 may use different code pages. For example, ANSI code page 1252 is
11759 used for American English and most European languages; OEM code page
11760 932 is used for Japanese Kanji.
11761
11762 A code page can be represented in a table as a mapping of characters
11763 to single-byte values or multibyte values. Many code pages share the
11764 ASCII character set for characters in the range 0x00 ?0x7F.
11765
11766 The Microsoft run-time library uses the following types of code pages:
11767
11768 -- System-default ANSI code page. By default, at startup the run-time
11769 system automatically sets the multibyte code page to the
11770 system-default ANSI code page, which is obtained from the operating
11771 system. The call
11772
11773 setlocale ( LC_ALL, "" );
11774
11775 also sets the locale to the system-default ANSI code page.
11776
11777 -- Locale code page. The behavior of a number of run-time routines is
11778 dependent on the current locale setting, which includes the locale
11779 code page. (For more information, see Locale-Dependent Routines.) By
11780 default, all locale-dependent routines in the Microsoft run-time
11781 library use the code page that corresponds to the ¡ë?locale. At
11782 run-time you can change or query the locale code page in use with a
11783 call to setlocale.
11784
11785 -- Multibyte code page. The behavior of most of the multibyte-character
11786 routines in the run-time library depends on the current multibyte
11787 code page setting. By default, these routines use the system-default
11788 ANSI code page. At run-time you can query and change the multibyte
11789 code page with _getmbcp and _setmbcp, respectively.
11790
11791 -- The "C" locale is defined by ANSI to correspond to the locale in
11792 which C programs have traditionally executed. The code page for the
11793 "C" locale (¡ë?code page) corresponds to the ASCII character
11794 set. For example, in the "C" locale, islower returns true for the
11795 values 0x61 ?0x7A only. In another locale, islower may return true
11796 for these as well as other values, as defined by that locale.
11797
11798 Under "Locale-Dependent Routines" we notice the following setlocale
11799 dependencies:
11800
11801 atof, atoi, atol (LC_NUMERIC)
11802 is Routines (LC_CTYPE)
11803 isleadbyte (LC_CTYPE)
11804 localeconv (LC_MONETARY, LC_NUMERIC)
11805 MB_CUR_MAX (LC_CTYPE)
11806 _mbccpy (LC_CTYPE)
11807 _mbclen (LC_CTYPE)
11808 mblen (LC_CTYPE )
11809 _mbstrlen (LC_CTYPE)
11810 mbstowcs (LC_CTYPE)
11811 mbtowc (LC_CTYPE)
11812 printf (LC_NUMERIC, for radix character output)
11813 scanf (LC_NUMERIC, for radix character recognition)
11814 setlocale/_wsetlocale (Not applicable)
11815 strcoll (LC_COLLATE)
11816 _stricoll/_wcsicoll (LC_COLLATE)
11817 _strncoll/_wcsncoll (LC_COLLATE)
11818 _strnicoll/_wcsnicoll (LC_COLLATE)
11819 strftime, wcsftime (LC_TIME)
11820 _strlwr (LC_CTYPE)
11821 strtod/wcstod/strol/wcstol/strtoul/wcstoul (LC_NUMERIC, for radix character recognition)
11822 _strupr (LC_CTYPE)
11823 strxfrm/wcsxfrm (LC_COLLATE)
11824 tolower/towlower (LC_CTYPE)
11825 toupper/towupper (LC_CTYPE)
11826 wcstombs (LC_CTYPE)
11827 wctomb (LC_CTYPE)
11828 _wtoi/_wtol (LC_NUMERIC)
11829 @end quotation
11830
11831 NOTE: The above documentation doesn't clearly explain the "locale code
11832 page" and "multibyte code page". These are two different values,
11833 maintained respectively in the CRT global variables __lc_codepage and
11834 __mbcodepage. Calling e.g. setlocale (LC_ALL, "JAPANESE") sets @strong{ONLY}
11835 __lc_codepage to 932 (the code page for Japanese), and leaves
11836 __mbcodepage unchanged (usually 1252, i.e. Windows-ANSI). You'd have to
11837 call _setmbcp() to change __mbcodepage. Figuring out from the
11838 documentation which routines use which code page is not so obvious. But:
11839
11840 @itemize @bullet
11841 @item
11842 from "Interpretation of Multibyte-Character Sequences" it appears that
11843 all "multibyte-character routines" use the multibyte code page except for
11844 mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), and wctomb().
11845
11846 @item
11847 from "_setmbcp": "The multibyte code page also affects
11848 multibyte-character processing by the following run-time library
11849 routines: _exec functions _mktemp _stat _fullpath _spawn functions
11850 _tempnam _makepath _splitpath tmpnam. In addition, all run-time library
11851 routines that receive multibyte-character argv or envp program arguments
11852 as parameters (such as the _exec and _spawn families) process these
11853 strings according to the multibyte code page. Hence these routines are
11854 also affected by a call to _setmbcp that changes the multibyte code
11855 page."
11856 @end itemize
11857
11858 Summary: from looking at the CRT source (which comes with VC++) and
11859 carefully looking through the docs, it appears that:
11860
11861 @itemize @bullet
11862 @item
11863 the "locale code page" is used by all of the routines listed above
11864 under "Locale-Dependent Routines" (EXCEPT _mbccpy() and _mbclen()),
11865 as well as any other place that converts between multibyte and Unicode
11866 strings, e.g. the startup code.
11867 @item
11868 the "multibyte code page" is used in all of the *mb*() routines
11869 except mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(),
11870 and wctomb(); also _exec*(), _spawn*(), _mktemp(), _stat(), _fullpath(),
11871 _tempnam(), _makepath(), _splitpath(), tmpnam(), and similar functions
11872 without the leading underscore.
11873 @end itemize
11874
11875 @node More about locales, Unicode support under Windows, More about code pages, Microsoft Windows-Related Multilingual Issues
11876 @subsection More about locales
11877 @cindex more about locales
11878
11879 In addition to the locale defined by the CRT, Windows (i.e. the Win32 API)
11880 defines various locales:
11881
11882 @itemize @bullet
11883 @item
11884 The system-default locale is the locale defined under "Language
11885 settings for the system" in the "Regional Options" control panel. This
11886 is NOT user-specific, and changing it requires a reboot (at least under
11887 Windows 2000). The ANSI code page of the system-default locale is
11888 returned by GetACP(), and you can specify this code page in calls
11889 e.g. to MultiByteToWideChar with the constant CP_ACP.
11890
11891 @item
11892 The user-default locale is the locale defined under "Settings for the
11893 current user" in the "Regional Options" control panel.
11894
11895 @item
11896 There is a thread-local locale set by SetThreadLocale. #### What is this
11897 used for?
11898 @end itemize
11899
11900 The Win32 API has a bunch of multibyte functions -- all of those that
11901 end with ...A(), and on which we spend so much effort in
11902 intl-encap-win32.c. These appear to ALWAYS use the ANSI code page of
11903 the system-default locale (GetACP(), CP_ACP). Note that this applies
11904 also, for example, to the encoding of filenames in all file-handling
11905 routines, including the CRT ones such as open(), because they pass their
11906 args unchanged to the Win32 API.
11907
11908 @node Unicode support under Windows, The golden rules of writing Unicode-safe code, More about locales, Microsoft Windows-Related Multilingual Issues
11909 @subsection Unicode support under Windows
11910 @cindex unicode support under windows
11911
11912 Basically, the whole concept of locales and code pages is broken, because
11913 it is extremely messy to support and does not allow for documents that use
11914 multiple languages simultaneously. Unicode was designed in response to
11915 this, the idea being to create a single character set that could be used to
11916 encode all the world's languages. Windows has supported Unicode since the
11917 beginning of the Win32 API. Internally, every code page has an associated
11918 table to convert the characters of that code page to and from Unicode, and
11919 the Win32 API itself probably (perhaps always) uses Unicode internally.
11920
11921 Under Windows there are two different versions of all library routines that
11922 accept or return text, those that handle Unicode text and those handling
11923 "multibyte" text, i.e. variable-width ASCII-compatible text in some
11924 national format such as EUC or Shift-JIS. Because Windows 95 basically
11925 doesn't support Unicode but Windows NT does, and Microsoft doesn't provide
11926 any way of writing a single binary that will work on both systems and still
11927 use Unicode when it's available (although see below, Microsoft Layer for
11928 Unicode), we need to provide a way of run-time conditionalizing so you
11929 could have one binary for both systems. "Unicode-splitting" refers to
11930 writing code that will handle this properly. This means using
11931 Qmswindows_tstr as the external conversion format, calling the appropriate
11932 qxe...() Unicode-split version of library functions, and doing other things
11933 in certain cases, e.g. when a qxe() function is not present.
11934
11935 Unicode support also requires that the various Windows API's be
11936 "Unicode-encapsulated", so that they automatically call the ANSI or
11937 Unicode version of the API call appropriately and handle the size
11938 differences in structures. What this means is:
11939
11940 @itemize @bullet
11941 @item
11942 first, note that Windows already provides a sort of encapsulation
11943 of all API's that deal with text. All such API's are underlyingly
11944 provided in two versions, with an A or W suffix (ANSI or "wide"
11945 i.e. Unicode), and the compile-time constant UNICODE controls which is
11946 selected by the unsuffixed API. Same thing happens with structures, and
11947 also with types, where the generic types have names beginning with T --
11948 TCHAR, LPTSTR, etc.. Unfortunately, this is compile-time only, not
11949 run-time, so not sufficient. (Creating the necessary run-time encoding
11950 is not conceptually difficult, but very time-consuming to write. It
11951 adds no significant overhead, and the only reason it's not standard in
11952 Windows is conscious marketing attempts by Microsoft to cripple Windows
11953 95. FUCK MICROSOFT! They even describe in a KnowledgeBase article
11954 exactly how to create such an API [although we don't exactly follow
11955 their procedure], and point out its usefulness; the procedure is also
11956 described more generally in Nadine Kano's book on Win32
11957 internationalization -- written SIX YEARS AGO! Obviously Microsoft has
11958 such an API available internally.)
11959
11960 @item
11961 what we do is provide an encapsulation of each standard Windows API call
11962 that is split into A and W versions. current theory is to avoid all
11963 preprocessor games; so we name the function with a prefix -- "qxe"
11964 currently -- and require callers to use the prefixed name. Callers need
11965 to explicitly use the W version of all structures, and convert text
11966 themselves using Qmswindows_tstr. the qxe encapsulated version will
11967 automatically call the appropriate A or W version depending on whether
11968 we're running on 9x or NT (you can force use of the A calls on NT,
11969 e.g. for testing purposes, using the command- line switch -nuni aka
11970 -no-unicode-lib-calls), and copy data between W and A versions of the
11971 structures as necessary.
11972
11973 @item
11974 We require the caller to handle the actual translation of text to
11975 avoid possible overflow when dealing with fixed-size Windows
11976 structures. There are no such problems when copying data between
11977 the A and W versions because ANSI text is never larger than its
11978 equivalent Unicode representation.
11979 @end itemize
11980
11981 NOTE NOTE NOTE: As of August 2001, Microsoft (finally! See my nasty
11982 comment above) released their own Unicode-encapsulation library, called
11983 Microsoft Layer for Unicode on Windows 95/98/Me Systems. It tries to be
11984 more transparent than we are, in that
11985
11986 @itemize @bullet
11987 @item
11988 its routines do ANSI/Unicode string translation, while we don't, for
11989 efficiency (we already have to do internal/external conversion so it's
11990 no extra burden to do the proper conversion directly rather than always
11991 converting to Unicode and then doing a second conversion to ANSI as
11992 necessary)
11993
11994 @item
11995 rather than requiring separately-named routines (qxeFooBar), they
11996 physically override the existing routines at the link level. it also
11997 appears that they do this BADLY, in that if you link with the MLU, you
11998 get an application that runs ONLY on Win9x!!! (hint -- use
11999 GetProcAddress()). there's still no way to create a single binary!
12000 fucking losers.
12001
12002 @item
12003 they assume you compile with UNICODE defined, so there's no need for the
12004 application to explicitly use ...W structures, as we require.
12005
12006 @item
12007 they also intercept windows procedures to deal with notify messages as
12008 necessary, which we don't do yet.
12009
12010 @item
12011 they (of course) don't use Extbyte.
12012 @end itemize
12013
12014 at some point (especially when they fix the single-binary problem!), we
12015 should consider switching. for the meantime, we'll stick with what i've
12016 already written. perhaps we should think about adopting some of the
12017 greater transparency they have; but i opted against transparency on
12018 purpose, to make the code easier to follow for someone who's not familiar
12019 with it. until our library is really complete and bug-free, we should
12020 think twice before doing this.
12021
12022 According to Microsoft documentation, only the following functions are
12023 provided under Windows 9x to support Unicode (see MSDN page "Windows
12024 95/98/Me General Limitations"):
12025
12026 EnumResourceLanguages
12027 EnumResourceNames
12028 EnumResourceTypes
12029 ExtTextOut
12030 FindResource
12031 FindResourceEx
12032 GetCharWidth
12033 GetCommandLine
12034 GetTextExtentPoint
12035 GetTextExtentPoint32
12036 lstrcat
12037 lstrcpy
12038 lstrlen
12039 MessageBox
12040 MessageBoxEx
12041 MultiByteToWideChar
12042 TextOut
12043 WideCharToMultiByte
12044
12045 also maybe GetTextExtentExPoint? (KB Q125671 "Unicode Functions Supported
12046 by Windows 95")
12047
12048 However, the C runtime library provides some additional support (according
12049 to the CRT sources, as the docs are not very clear on this):
12050
12051 @itemize @bullet
12052 @item
12053 wmain() is completely supported, and appropriate Unicode-formatted argv
12054 and envp will always be passed.
12055 @item
12056 Likewise, wWinMain() is completely supported. (NOTE: The docs are not at
12057 all clear on how these various entry points interact, and implies that
12058 a windows-subsystem program "must" use WinMain(), while a console-
12059 subsystem program "must" use main(), and a program compiled with UNICODE
12060 (which we don't, see above) "must" use the w*() versions, while a program
12061 not compiled this way "must" use the plain versions. In fact it appears
12062 that the CRT provides four different compiler entry points, namely
12063 w?(main|WinMain)CRTStartup, and we simply choose the one we like using
12064 the appropriate link flag.
12065 @item
12066 _wenviron, _wputenv
12067 @end itemize
12068
12069 NOTE:
12070
12071 @itemize @bullet
12072 @item
12073 wsetargv.obj uses routines that were buggily left out of MSVCRT; anyway,
12074 from looking at the source, it does NOT correctly work under Win 9x as
12075 it blindly calls the Unicode version of Unicode-split API's such as
12076 FindFirstFile)
12077
12078 @item
12079 the w*() file routines are @strong{NOT} supported -- or at least, they blindly
12080 call the ...W() versions of the Win32 API calls.
12081 @end itemize
12082
12083 @node The golden rules of writing Unicode-safe code, The format of the locale in setlocale(), Unicode support under Windows, Microsoft Windows-Related Multilingual Issues
12084 @subsection The golden rules of writing Unicode-safe code
12085 @cindex the golden rules of writing unicode-safe code
12086
12087 @itemize @bullet
12088 @item
12089 There are no preprocessor games going on.
12090
12091 @item
12092 Do not set the UNICODE constant.
12093
12094 @item
12095 You need to change your code to call the Windows API prefixed with "qxe"
12096 functions (when they exist) and use the ...W structs instead of the
12097 generic ones. String arguments in the qxe functions are of type Extbyte
12098 *.
12099
12100 @item
12101 You code is responsible for conversion of text arguments. We try to
12102 handle everything else -- the argument differences, the copying back and
12103 forth of structures, etc. Use Qmswindows_tstr and macros such as
12104 C_STRING_TO_TSTR. You are also responsible for interpreting and
12105 specifying string sizes, which have not been changed. Usually these are
12106 in characters, meaning you need to divide by XETCHAR_SIZE. (But, some
12107 functions want sizes in bytes, even with Unicode strings. Look in the
12108 documentation.) Use XETEXT when specifying string constants, so that
12109 they show up in Unicode as necessary.
12110
12111 @item
12112 If you need to process external strings (in general you should not do
12113 this; do all your manipulations in internal format and convert at the
12114 point of entry into or exit from the function), use the xet...()
12115 functions.
12116
12117 @item
12118 If you have to declare a fixed array to hold a string coming from
12119 Windows (and hence either multibyte or Unicode), declare it of type
12120 Extbyte[] and multiply the size by MAX_XETCHAR_SIZE.
12121 @end itemize
12122
12123 @node The format of the locale in setlocale(), Random other Windows I18N docs, The golden rules of writing Unicode-safe code, Microsoft Windows-Related Multilingual Issues
12124 @subsection The format of the locale in setlocale()
12125 @cindex the format of the locale in setlocale()
12126
12127 It appears that under Unix the standard format for the string in
12128 setlocale() involves two-letter language and country abbreviations, e.g.
12129 ja or ja_jp or ja_jp.euc for Japanese. Windows (MSDN article "Language
12130 Strings" in the run-time reference appendix, see doc list above) speaks
12131 of "(primary) language" and "sublanguage" (usually a country, but in the
12132 case of Chinese the sublanguage is "simplified" or "traditional"). It
12133 is highly flexible in what it takes, and thankfully it canonicalizes the
12134 result to a unique form "Language_Country.Encoding". It allows (note
12135 that all specifications can be in any case):
12136
12137 @itemize @bullet
12138 @item
12139 the full "language_country.encoding" specification or just
12140 language_country", in which case the default encoding will be chosen.
12141
12142 @item
12143 a three-letter acronym, consisting of the ISO-standard two-letter
12144 language abbreviation followed by a third letter indicating the
12145 sublanguage.
12146
12147 @item
12148 just a language name, e.g. "dutch", standing for the combination of
12149 the language with "default" as sublanguage, referring to the default
12150 (often "prototypical") country for that language (in this case the
12151 Netherlands). You can abbreviate the name by removing any number of
12152 letters from the end. Ambiguity is not a problem: Even specifying
12153 just a single letter is valid providing any language starting with
12154 that letter exists, but the result may not be what you want (e.g. "c"
12155 maps to "catalan", not "chinese", "czech", etc.). The way of
12156 resolving ambiguity appears fairly random -- it's not alphabetical
12157 ("a" maps to "arabic" not "albanian").
12158
12159 @item
12160 a combination of language and sublanguage separated by a hyphen,
12161 e.g. "dutch-belgian"; note that the sublanguage designator in this
12162 case is NOT necessarily the same as the country, e.g. "belgian" vs.
12163 "belgium". "dutch-belgium" (or even "dutch-belg") does @strong{NOT} get you
12164 the right result, but returns "Dutch_Netherlands.1252" instead! This
12165 is because, although you may not abbreviate the result, Windows
12166 accepts any unknown value in the sublanguage field and treats it as
12167 equivalent to "default". Note also that the if the sublanguage name
12168 has underscores in it, you need to change them to spaces, e.g.
12169 "spanish-dominican republic".
12170
12171 @item
12172 sometimes, just a sublanguage name, e.g. "belgian", standing for
12173 the combination of one of the languages spoken in that region and
12174 the sublanguage of the region -- in this case Dutch. Note that
12175 there is no guarantee of "protypicality" in this case in choice of
12176 language! You could hardly say that Dutch (aka Flemish) is more
12177 prototypical of Belgium than French. You cannot abbreviate this
12178 form, if it's allowed at all.
12179 @end itemize
12180
12181 In addition:
12182
12183 @itemize @bullet
12184 @item
12185 note further that you are not limited to the language/sublanguage
12186 combinations predefined by Windows. You can set weird combinations
12187 like "Chinese_Kenya.1255" (Chinese spoken in Kenya, represented by
12188 Windows-1255, i.e. Hebrew!) and Windows don't complain, despite the
12189 language-encoding inconsistency. You can also make up a weird
12190 combination and leave out the encoding, e.g. "Chinese_Qatar", which
12191 maps to "Chinese_Qatar.1256", where Windows-1256 is Arabic -- i.e. it
12192 appears to be choosing the encoding based on a default for the
12193 country.
12194
12195 @item
12196 note also that the names for countries are often not what you expect.
12197 "urdu_pakistan" fails, and just "urdu" shows why, as it maps to
12198 "Urdu_Islamic Republic of Pakistan.1256". That is, some countries
12199 exist in their full name, and the canonicalized form with underscore
12200 is not very forgiving in its handling of country specifications.
12201 Similarly, Uzbekistan is "Republic of Uzbekistan", and "China" is
12202 "People's Republic of China" -- but in this latter case, unlike the
12203 other two, just "China" works as an alias, e.g. "uzbek_china" maps
12204 to "Uzbek_People's Republic of China.936".
12205
12206 @item
12207 note that just the two-letter ISO language code is NOT allowed.
12208 Sometimes you'll get lucky (e.g. "fr" does map to "france"), but
12209 sometimes you'll get no match (e.g. "pl"), and sometimes you'll get
12210 really unlucky in that the call will succeed but with the wrong
12211 language (e.g. "es" maps to "estonian", not "spanish").
12212 @end itemize
12213
12214 As an example, MSDN article "Language Strings" indicates that German
12215 (default) can be specified using "deu" or "german"; German (Austrian)
12216 with "dea" or "german-austrian"; German (Swiss) with "des",
12217 "german-swiss", or "swiss"; French (Swiss) with "french-swiss" or "frs";
12218 and English (USA) with "american", "american english",
12219 "american-english", "english-american", "english-us", "english-usa",
12220 "enu", "us", or "usa". This is not, of course, an exhaustive list even
12221 for just the given locales -- just "english" works in practice because
12222 English (Default) maps to English (USA). (#### Is this always the case?)
12223
12224 Given the canonicalization, we don't have to worry too much about the
12225 different kinds of inputs to setlocale() -- unlike for Unix, where no
12226 canonicalization is usually performed, the particular locales that
12227 exist vary tremendously from OS to OS, and we need to parse the
12228 uncanonicalized locale spec, directly from the user, to figure out the
12229 encoding to use, making various guesses if not enough information is
12230 present. Yuck! The tricky thing under Windows is figuring how to
12231 deal with the sublang. It appears that the trick of simply passing the
12232 text of the manifest constant itself of the sublang, with appropriate
12233 hacking (e.g. of underscore to space), works most of the time.
12234
12235 @node Random other Windows I18N docs, , The format of the locale in setlocale(), Microsoft Windows-Related Multilingual Issues
12236 @subsection Random other Windows I18N docs
12237 @cindex random other windows i18n docs
12238
12239 Introduction to Internationalization Issues in the Win32 API
12240
12241 Abstract: This page provides an overview of the aspects of the Win32
12242 internationalization API that are relevant to XEmacs, including the
12243 basic distinction between multibyte and Unicode encodings. Also
12244 included are pointers to how XEmacs should make use of this API.
12245
12246 The Win32 API is quite well-designed in its handling of strings
12247 encoded for various character sets. The API is geared around the idea
12248 that two different methods of encoding strings should be
12249 supported. These methods are called multibyte and Unicode,
12250 respectively. The multibyte encoding is compatible with ASCII strings
12251 and is a more efficient representation when dealing with strings
12252 containing primarily ASCII characters, but it has a great number of
12253 serious deficiencies and limitations, including that it is very
12254 difficult and error-prone to work with strings in this encoding, and
12255 any particular string in a multibyte encoding can only contain
12256 characters from a very limited number of character sets. The Unicode
12257 encoding rectifies all of these deficiencies, but it is not compatible
12258 with ASCII strings (in other words, an existing program will not be
12259 able to handle the encoded strings unless it is explicitly modified to
12260 do so), and it takes up twice as much memory space as multibyte
12261 encodings when encoding a purely ASCII string.
12262
12263 Multibyte encodings use a variable number of bytes (either one or two)
12264 to represent characters. ASCII characters are also represented by a
12265 single byte with its high bit not set, and non-ASCII characters are
12266 represented by one or two bytes, the first of which always has its
12267 high bit set. (The second byte, when it exists, may or may not have
12268 its high bit set.) There is no single multibyte encoding. Instead,
12269 there is generally one encoding per non-ASCII character set. Such an
12270 encoding is capable of representing (besides ASCII characters, of
12271 course) only characters from one (or possibly two) particular
12272 character sets.
12273
12274 Multibyte encoding makes processing of strings very difficult. For
12275 example, given a pointer to the beginning of a character within a
12276 string, finding the pointer to the beginning of the previous character
12277 may require backing up all the way to the beginning of the string, and
12278 then moving forward. Also, an operation such as separating out the
12279 components of a path by searching for backslashes will fail if it's
12280 implemented in the simplest (but not multibyte-aware) fashion, because
12281 it may find what appears to be a backslash, but which is actually the
12282 second byte of a two-byte character. Also, the limited number of
12283 character sets that any particular multibyte encoding can represent
12284 means that loss of data is likely if a string is converted from the
12285 XEmacs internal format into a multibyte format.
12286
12287 For these reasons, the C code in XEmacs should never do any sort of
12288 work with multibyte encoded strings (or with strings in any external
12289 encoding for that matter). Strings should always be maintained in the
12290 internal encoding, which is predictable, and converted to an external
12291 encoding only at the point where the string moves from the XEmacs C
12292 code and enters a system library function. Similarly, when a string is
12293 returned from a system library function, it should be immediately
12294 converted into the internal coding before any operations are done on
12295 it.
12296
12297 Unicode, unlike multibyte encodings, is a fixed-width encoding where
12298 every character is represented using 16 bits. It is also capable of
12299 encoding all the characters from all the character sets in common use
12300 in the world. The predictability and completeness of the Unicode
12301 encoding makes it a very good encoding for strings that may contain
12302 characters from many character sets mixed up with each other. At the
12303 same time, of course, it is incompatible with routines that expect
12304 ASCII characters and also incompatible with general string
12305 manipulation routines, which will encounter a great number of what
12306 would appear to be embedded nulls in the string. It also takes twice
12307 as much room to encode strings containing primarily ASCII
12308 characters. This is why XEmacs does not use Unicode or similar
12309 encoding internally for buffers.
12310
12311 The Win32 API cleverly deals with the issue of 8 bit vs. 16 bit
12312 characters by declaring a type called TCHAR which specifies a generic
12313 character, either 8 bits or 16 bits. Generally TCHAR is defined to be
12314 the same as the simple C type char, unless the preprocessor constant
12315 UNICODE is defined, in which case TCHAR is defined to be WCHAR, which
12316 is a 16 bit type. Nearly all functions in the Win32 API that take
12317 strings are defined to take strings that are actually arrays of
12318 TCHARs. There is a type LPTSTR which is defined to be a string of
12319 TCHARs and another type LPCTSTR which is a const string of TCHARs. The
12320 theory is that any program that uses TCHARs exclusively to represent
12321 characters and does not make assumptions about the size of a TCHAR or
12322 the way that the characters are encoded should work transparently
12323 regardless of whether the UNICODE preprocessor constant is defined,
12324 which is to say, regardless of whether 8 bit multibyte or 16 bit
12325 Unicode characters are being used. The way that this is actually
12326 implemented is that every Win32 API function that takes a string as an
12327 argument actually maps to one of two functions which are suffixed with
12328 an A (which stands for ANSI, and means multibyte strings) or W (which
12329 stands for wide, and means Unicode strings). The mapping is, of
12330 course, controlled by the same UNICODE preprocessor
12331 constant. Generally all structures containing strings in them actually
12332 map to one of two different kinds of structures, with either an A or a
12333 W suffix after the structure name.
12334
12335 Unfortunately, not all of the implementations of the Win32 API
12336 implement all of the functionality described above. In particular,
12337 Windows 95 does not implement very much Unicode functionality. It does
12338 implement functions to convert multibyte-encoded strings to and from
12339 Unicode strings, and provides Unicode versions of certain low-level
12340 functions like ExtTextOut(). In fact, all of the rest of the Unicode
12341 versions of API functions are just stubs that return an
12342 error. Conversely, all versions of Windows NT completely implement all
12343 the Unicode functionality, but some versions (especially versions
12344 before Windows NT 4.0) don't implement much of the multibyte
12345 functionality. For this reason, as well as for general code
12346 cleanliness, XEmacs needs to be written in such a way that it works
12347 with or without the UNICODE preprocessor constant being defined.
12348
12349 Getting XEmacs to run when all strings are Unicode primarily involves
12350 removing any assumptions made about the size of characters. Remember
12351 what I said earlier about how the point of conversion between
12352 internally and externally encoded strings should occur at the point of
12353 entry or exit into or out of a library function. With this in mind, an
12354 externally encoded string in XEmacs can be treated simply as an
12355 arbitrary sequence of bytes of some length which has no particular
12356 relationship to the length of the string in the internal encoding.
12357
12358 Use Qnative for Unix conversion, Qmswindows_tstr for Windows ...
12359
12360 String constants that are to be passed directly to Win32 API functions,
12361 such as the names of window classes, need to be bracketed in their
12362 definition with a call to the macro XETEXT. This appropriately makes a
12363 string of either regular or wide chars, which is to say this string may be
12364 prepended with an L (causing it to be a wide string) depending on
12365 XEUNICODE_P.
12366
12367 @node Modules for Internationalization, , Microsoft Windows-Related Multilingual Issues, Multilingual Support
12368 @section Modules for Internationalization
12369 @cindex modules for internationalization
12370 @cindex internationalization, modules for
12371
12372 @example
12373 @file{mule-canna.c}
12374 @file{mule-ccl.c}
12375 @file{mule-charset.c}
12376 @file{mule-charset.h}
12377 @file{file-coding.c}
12378 @file{file-coding.h}
12379 @file{mule-coding.c}
12380 @file{mule-mcpath.c}
12381 @file{mule-mcpath.h}
12382 @file{mule-wnnfns.c}
12383 @file{mule.c}
12384 @end example
12385
12386 These files implement the MULE (Asian-language) support. Note that MULE
12387 actually provides a general interface for all sorts of languages, not
12388 just Asian languages (although they are generally the most complicated
12389 to support). This code is still in beta.
12390
12391 @file{mule-charset.*} and @file{file-coding.*} provide the heart of the
12392 XEmacs MULE support. @file{mule-charset.*} implements the @dfn{charset}
12393 Lisp object type, which encapsulates a character set (an ordered one- or
12394 two-dimensional set of characters, such as US ASCII or JISX0208 Japanese
12395 Kanji).
12396
12397 @file{file-coding.*} implements the @dfn{coding-system} Lisp object
12398 type, which encapsulates a method of converting between different
12399 encodings. An encoding is a representation of a stream of characters,
12400 possibly from multiple character sets, using a stream of bytes or words,
12401 and defines (e.g.) which escape sequences are used to specify particular
12402 character sets, how the indices for a character are converted into bytes
12403 (sometimes this involves setting the high bit; sometimes complicated
12404 rearranging of the values takes place, as in the Shift-JIS encoding),
12405 etc. It also contains some generic coding system implementations, such
12406 as the binary (no-conversion) coding system and a sample gzip coding system.
12407
12408 @file{mule-coding.c} contains the implementations of text coding systems.
12409
12410 @file{mule-ccl.c} provides the CCL (Code Conversion Language)
12411 interpreter. CCL is similar in spirit to Lisp byte code and is used to
12412 implement converters for custom encodings.
12413
12414 @file{mule-canna.c} and @file{mule-wnnfns.c} implement interfaces to
12415 external programs used to implement the Canna and WNN input methods,
12416 respectively. This is currently in beta.
12417
12418 @file{mule-mcpath.c} provides some functions to allow for pathnames
12419 containing extended characters. This code is fragmentary, obsolete, and
12420 completely non-working. Instead, @code{pathname-coding-system} is used
12421 to specify conversions of names of files and directories. The standard
12422 C I/O functions like @samp{open()} are wrapped so that conversion occurs
12423 automatically.
12424
12425 @file{mule.c} contains a few miscellaneous things. It currently seems
12426 to be unused and probably should be removed.
12427
12428
12429
12430 @example
12431 @file{intl.c}
12432 @end example
12433
12434 This provides some miscellaneous internationalization code for
12435 implementing message translation and interfacing to the Ximp input
12436 method. None of this code is currently working.
12437
12438
12439
12440 @example
12441 @file{iso-wide.h}
12442 @end example
12443
12444 This contains leftover code from an earlier implementation of
12445 Asian-language support, and is not currently used.
12446
12447
12448 @node Consoles; Devices; Frames; Windows, The Redisplay Mechanism, Multilingual Support, Top
12449 @chapter Consoles; Devices; Frames; Windows
12450 @cindex consoles; devices; frames; windows
12451 @cindex devices; frames; windows, consoles;
12452 @cindex frames; windows, consoles; devices;
12453 @cindex windows, consoles; devices; frames;
12454
12455 @menu
12456 * Introduction to Consoles; Devices; Frames; Windows::
12457 * Point::
12458 * Window Hierarchy::
12459 * The Window Object::
12460 * Modules for the Basic Displayable Lisp Objects::
12461 @end menu
12462
12463 @node Introduction to Consoles; Devices; Frames; Windows, Point, Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows
12464 @section Introduction to Consoles; Devices; Frames; Windows
12465 @cindex consoles; devices; frames; windows, introduction to
12466 @cindex devices; frames; windows, introduction to consoles;
12467 @cindex frames; windows, introduction to consoles; devices;
12468 @cindex windows, introduction to consoles; devices; frames;
12469
12470 A window-system window that you see on the screen is called a
12471 @dfn{frame} in Emacs terminology. Each frame is subdivided into one or
12472 more non-overlapping panes, called (confusingly) @dfn{windows}. Each
12473 window displays the text of a buffer in it. (See above on Buffers.) Note
12474 that buffers and windows are independent entities: Two or more windows
12475 can be displaying the same buffer (potentially in different locations),
12476 and a buffer can be displayed in no windows.
12477
12478 A single display screen that contains one or more frames is called
12479 a @dfn{display}. Under most circumstances, there is only one display.
12480 However, more than one display can exist, for example if you have
12481 a @dfn{multi-headed} console, i.e. one with a single keyboard but
12482 multiple displays. (Typically in such a situation, the various
12483 displays act like one large display, in that the mouse is only
12484 in one of them at a time, and moving the mouse off of one moves
12485 it into another.) In some cases, the different displays will
12486 have different characteristics, e.g. one color and one mono.
12487
12488 XEmacs can display frames on multiple displays. It can even deal
12489 simultaneously with frames on multiple keyboards (called @dfn{consoles} in
12490 XEmacs terminology). Here is one case where this might be useful: You
12491 are using XEmacs on your workstation at work, and leave it running.
12492 Then you go home and dial in on a TTY line, and you can use the
12493 already-running XEmacs process to display another frame on your local
12494 TTY.
12495
12496 Thus, there is a hierarchy console -> display -> frame -> window.
12497 There is a separate Lisp object type for each of these four concepts.
12498 Furthermore, there is logically a @dfn{selected console},
12499 @dfn{selected display}, @dfn{selected frame}, and @dfn{selected window}.
12500 Each of these objects is distinguished in various ways, such as being the
12501 default object for various functions that act on objects of that type.
12502 Note that every containing object remembers the ``selected'' object
12503 among the objects that it contains: e.g. not only is there a selected
12504 window, but every frame remembers the last window in it that was
12505 selected, and changing the selected frame causes the remembered window
12506 within it to become the selected window. Similar relationships apply
12507 for consoles to devices and devices to frames.
12508
12509 @node Point, Window Hierarchy, Introduction to Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows
12510 @section Point
12511 @cindex point
12512
12513 Recall that every buffer has a current insertion position, called
12514 @dfn{point}. Now, two or more windows may be displaying the same buffer,
12515 and the text cursor in the two windows (i.e. @code{point}) can be in
12516 two different places. You may ask, how can that be, since each
12517 buffer has only one value of @code{point}? The answer is that each window
12518 also has a value of @code{point} that is squirreled away in it. There
12519 is only one selected window, and the value of ``point'' in that buffer
12520 corresponds to that window. When the selected window is changed
12521 from one window to another displaying the same buffer, the old
12522 value of @code{point} is stored into the old window's ``point'' and the
12523 value of @code{point} from the new window is retrieved and made the
12524 value of @code{point} in the buffer. This means that @code{window-point}
12525 for the selected window is potentially inaccurate, and if you
12526 want to retrieve the correct value of @code{point} for a window,
12527 you must special-case on the selected window and retrieve the
12528 buffer's point instead. This is related to why @code{save-window-excursion}
12529 does not save the selected window's value of @code{point}.
12530
12531 @node Window Hierarchy, The Window Object, Point, Consoles; Devices; Frames; Windows
12532 @section Window Hierarchy
12533 @cindex window hierarchy
12534 @cindex hierarchy of windows
12535
12536 If a frame contains multiple windows (panes), they are always created
12537 by splitting an existing window along the horizontal or vertical axis.
12538 Terminology is a bit confusing here: to @dfn{split a window
12539 horizontally} means to create two side-by-side windows, i.e. to make a
12540 @emph{vertical} cut in a window. Likewise, to @dfn{split a window
12541 vertically} means to create two windows, one above the other, by making
12542 a @emph{horizontal} cut.
12543
12544 If you split a window and then split again along the same axis, you
12545 will end up with a number of panes all arranged along the same axis.
12546 The precise way in which the splits were made should not be important,
12547 and this is reflected internally. Internally, all windows are arranged
12548 in a tree, consisting of two types of windows, @dfn{combination} windows
12549 (which have children, and are covered completely by those children) and
12550 @dfn{leaf} windows, which have no children and are visible. Every
12551 combination window has two or more children, all arranged along the same
12552 axis. There are (logically) two subtypes of windows, depending on
12553 whether their children are horizontally or vertically arrayed. There is
12554 always one root window, which is either a leaf window (if the frame
12555 contains only one window) or a combination window (if the frame contains
12556 more than one window). In the latter case, the root window will have
12557 two or more children, either horizontally or vertically arrayed, and
12558 each of those children will be either a leaf window or another
12559 combination window.
12560
12561 Here are some rules:
12562
12563 @enumerate
12564 @item
12565 Horizontal combination windows can never have children that are
12566 horizontal combination windows; same for vertical.
12567
12568 @item
12569 Only leaf windows can be split (obviously) and this splitting does one
12570 of two things: (a) turns the leaf window into a combination window and
12571 creates two new leaf children, or (b) turns the leaf window into one of
12572 the two new leaves and creates the other leaf. Rule (1) dictates which
12573 of these two outcomes happens.
12574
12575 @item
12576 Every combination window must have at least two children.
12577
12578 @item
12579 Leaf windows can never become combination windows. They can be deleted,
12580 however. If this results in a violation of (3), the parent combination
12581 window also gets deleted.
12582
12583 @item
12584 All functions that accept windows must be prepared to accept combination
12585 windows, and do something sane (e.g. signal an error if so).
12586 Combination windows @emph{do} escape to the Lisp level.
12587
12588 @item
12589 All windows have three fields governing their contents:
12590 these are @dfn{hchild} (a list of horizontally-arrayed children),
12591 @dfn{vchild} (a list of vertically-arrayed children), and @dfn{buffer}
12592 (the buffer contained in a leaf window). Exactly one of
12593 these will be non-@code{nil}. Remember that @dfn{horizontally-arrayed}
12594 means ``side-by-side'' and @dfn{vertically-arrayed} means
12595 @dfn{one above the other}.
12596
12597 @item
12598 Leaf windows also have markers in their @code{start} (the
12599 first buffer position displayed in the window) and @code{pointm}
12600 (the window's stashed value of @code{point}---see above) fields,
12601 while combination windows have @code{nil} in these fields.
12602
12603 @item
12604 The list of children for a window is threaded through the
12605 @code{next} and @code{prev} fields of each child window.
12606
12607 @item
12608 @strong{Deleted windows can be undeleted}. This happens as a result of
12609 restoring a window configuration, and is unlike frames, displays, and
12610 consoles, which, once deleted, can never be restored. Deleting a window
12611 does nothing except set a special @code{dead} bit to 1 and clear out the
12612 @code{next}, @code{prev}, @code{hchild}, and @code{vchild} fields, for
12613 GC purposes.
12614
12615 @item
12616 Most frames actually have two top-level windows---one for the
12617 minibuffer and one (the @dfn{root}) for everything else. The modeline
12618 (if present) separates these two. The @code{next} field of the root
12619 points to the minibuffer, and the @code{prev} field of the minibuffer
12620 points to the root. The other @code{next} and @code{prev} fields are
12621 @code{nil}, and the frame points to both of these windows.
12622 Minibuffer-less frames have no minibuffer window, and the @code{next}
12623 and @code{prev} of the root window are @code{nil}. Minibuffer-only
12624 frames have no root window, and the @code{next} of the minibuffer window
12625 is @code{nil} but the @code{prev} points to itself. (#### This is an
12626 artifact that should be fixed.)
12627 @end enumerate
12628
12629 @node The Window Object, Modules for the Basic Displayable Lisp Objects, Window Hierarchy, Consoles; Devices; Frames; Windows
12630 @section The Window Object
12631 @cindex window object, the
12632 @cindex object, the window
12633
12634 Windows have the following accessible fields:
12635
12636 @table @code
12637 @item frame
12638 The frame that this window is on.
12639
12640 @item mini_p
12641 Non-@code{nil} if this window is a minibuffer window.
12642
12643 @item buffer
12644 The buffer that the window is displaying. This may change often during
12645 the life of the window.
12646
12647 @item dedicated
12648 Non-@code{nil} if this window is dedicated to its buffer.
12649
12650 @item pointm
12651 @cindex window point internals
12652 This is the value of point in the current buffer when this window is
12653 selected; when it is not selected, it retains its previous value.
12654
12655 @item start
12656 The position in the buffer that is the first character to be displayed
12657 in the window.
12658
12659 @item force_start
12660 If this flag is non-@code{nil}, it says that the window has been
12661 scrolled explicitly by the Lisp program. This affects what the next
12662 redisplay does if point is off the screen: instead of scrolling the
12663 window to show the text around point, it moves point to a location that
12664 is on the screen.
12665
12666 @item last_modified
12667 The @code{modified} field of the window's buffer, as of the last time
12668 a redisplay completed in this window.
12669
12670 @item last_point
12671 The buffer's value of point, as of the last time
12672 a redisplay completed in this window.
12673
12674 @item left
12675 This is the left-hand edge of the window, measured in columns. (The
12676 leftmost column on the screen is @w{column 0}.)
12677
12678 @item top
12679 This is the top edge of the window, measured in lines. (The top line on
12680 the screen is @w{line 0}.)
12681
12682 @item height
12683 The height of the window, measured in lines.
12684
12685 @item width
12686 The width of the window, measured in columns.
12687
12688 @item next
12689 This is the window that is the next in the chain of siblings. It is
12690 @code{nil} in a window that is the rightmost or bottommost of a group of
12691 siblings.
12692
12693 @item prev
12694 This is the window that is the previous in the chain of siblings. It is
12695 @code{nil} in a window that is the leftmost or topmost of a group of
12696 siblings.
12697
12698 @item parent
12699 Internally, XEmacs arranges windows in a tree; each group of siblings has
12700 a parent window whose area includes all the siblings. This field points
12701 to a window's parent.
12702
12703 Parent windows do not display buffers, and play little role in display
12704 except to shape their child windows. Emacs Lisp programs usually have
12705 no access to the parent windows; they operate on the windows at the
12706 leaves of the tree, which actually display buffers.
12707
12708 @item hscroll
12709 This is the number of columns that the display in the window is scrolled
12710 horizontally to the left. Normally, this is 0.
12711
12712 @item use_time
12713 This is the last time that the window was selected. The function
12714 @code{get-lru-window} uses this field.
12715
12716 @item display_table
12717 The window's display table, or @code{nil} if none is specified for it.
12718
12719 @item update_mode_line
12720 Non-@code{nil} means this window's mode line needs to be updated.
12721
12722 @item base_line_number
12723 The line number of a certain position in the buffer, or @code{nil}.
12724 This is used for displaying the line number of point in the mode line.
12725
12726 @item base_line_pos
12727 The position in the buffer for which the line number is known, or
12728 @code{nil} meaning none is known.
12729
12730 @item region_showing
12731 If the region (or part of it) is highlighted in this window, this field
12732 holds the mark position that made one end of that region. Otherwise,
12733 this field is @code{nil}.
12734 @end table
12735
12736 @node Modules for the Basic Displayable Lisp Objects, , The Window Object, Consoles; Devices; Frames; Windows
12737 @section Modules for the Basic Displayable Lisp Objects
12738 @cindex modules for the basic displayable Lisp objects
12739 @cindex displayable Lisp objects, modules for the basic
12740 @cindex Lisp objects, modules for the basic displayable
12741 @cindex objects, modules for the basic displayable Lisp
12742
12743 @example
12744 @file{console-msw.c}
12745 @file{console-msw.h}
12746 @file{console-stream.c}
12747 @file{console-stream.h}
12748 @file{console-tty.c}
12749 @file{console-tty.h}
12750 @file{console-x.c}
12751 @file{console-x.h}
12752 @file{console.c}
12753 @file{console.h}
12754 @end example
12755
12756 These modules implement the @dfn{console} Lisp object type. A console
12757 contains multiple display devices, but only one keyboard and mouse.
12758 Most of the time, a console will contain exactly one device.
12759
12760 Consoles are the top of a lisp object inclusion hierarchy. Consoles
12761 contain devices, which contain frames, which contain windows.
12762
12763
12764
12765 @example
12766 @file{device-msw.c}
12767 @file{device-tty.c}
12768 @file{device-x.c}
12769 @file{device.c}
12770 @file{device.h}
12771 @end example
12772
12773 These modules implement the @dfn{device} Lisp object type. This
12774 abstracts a particular screen or connection on which frames are
12775 displayed. As with Lisp objects, event interfaces, and other
12776 subsystems, the device code is separated into a generic component that
12777 contains a standardized interface (in the form of a set of methods) onto
12778 particular device types.
12779
12780 The device subsystem defines all the methods and provides method
12781 services for not only device operations but also for the frame, window,
12782 menubar, scrollbar, toolbar, and other displayable-object subsystems.
12783 The reason for this is that all of these subsystems have the same
12784 subtypes (X, TTY, NeXTstep, Microsoft Windows, etc.) as devices do.
12785
12786
12787
12788 @example
12789 @file{frame-msw.c}
12790 @file{frame-tty.c}
12791 @file{frame-x.c}
12792 @file{frame.c}
12793 @file{frame.h}
12794 @end example
12795
12796 Each device contains one or more frames in which objects (e.g. text) are
12797 displayed. A frame corresponds to a window in the window system;
12798 usually this is a top-level window but it could potentially be one of a
12799 number of overlapping child windows within a top-level window, using the
12800 MDI (Multiple Document Interface) protocol in Microsoft Windows or a
12801 similar scheme.
12802
12803 The @file{frame-*} files implement the @dfn{frame} Lisp object type and
12804 provide the generic and device-type-specific operations on frames
12805 (e.g. raising, lowering, resizing, moving, etc.).
12806
12807
12808
12809 @example
12810 @file{window.c}
12811 @file{window.h}
12812 @end example
12813
12814 @cindex window (in Emacs)
12815 @cindex pane
12816 Each frame consists of one or more non-overlapping @dfn{windows} (better
12817 known as @dfn{panes} in standard window-system terminology) in which a
12818 buffer's text can be displayed. Windows can also have scrollbars
12819 displayed around their edges.
12820
12821 @file{window.c} and @file{window.h} implement the @dfn{window} Lisp
12822 object type and provide code to manage windows. Since windows have no
12823 associated resources in the window system (the window system knows only
12824 about the frame; no child windows or anything are used for XEmacs
12825 windows), there is no device-type-specific code here; all of that code
12826 is part of the redisplay mechanism or the code for particular object
12827 types such as scrollbars.
12828
12829 @node The Redisplay Mechanism, Extents, Consoles; Devices; Frames; Windows, Top
12830 @chapter The Redisplay Mechanism
12831 @cindex redisplay mechanism, the
12832
12833 The redisplay mechanism is one of the most complicated sections of
12834 XEmacs, especially from a conceptual standpoint. This is doubly so
12835 because, unlike for the basic aspects of the Lisp interpreter, the
12836 computer science theories of how to efficiently handle redisplay are not
12837 well-developed.
12838
12839 When working with the redisplay mechanism, remember the Golden Rules
12840 of Redisplay:
12841
12842 @enumerate
12843 @item
12844 It Is Better To Be Correct Than Fast.
12845 @item
12846 Thou Shalt Not Run Elisp From Within Redisplay.
12847 @item
12848 It Is Better To Be Fast Than Not To Be.
12849 @end enumerate
12850
12851 @menu
12852 * Critical Redisplay Sections::
12853 * Line Start Cache::
12854 * Redisplay Piece by Piece::
12855 * Modules for the Redisplay Mechanism::
12856 * Modules for other Display-Related Lisp Objects::
12857 @end menu
12858
12859 @node Critical Redisplay Sections, Line Start Cache, The Redisplay Mechanism, The Redisplay Mechanism
12860 @section Critical Redisplay Sections
12861 @cindex redisplay sections, critical
12862 @cindex critical redisplay sections
12863
12864 Within this section, we are defenseless and assume that the
12865 following cannot happen:
12866
12867 @enumerate
12868 @item
12869 garbage collection
12870 @item
12871 Lisp code evaluation
12872 @item
12873 frame size changes
12874 @end enumerate
12875
12876 We ensure (3) by calling @code{hold_frame_size_changes()}, which
12877 will cause any pending frame size changes to get put on hold
12878 till after the end of the critical section. (1) follows
12879 automatically if (2) is met. #### Unfortunately, there are
12880 some places where Lisp code can be called within this section.
12881 We need to remove them.
12882
12883 If @code{Fsignal()} is called during this critical section, we
12884 will @code{abort()}.
12885
12886 If garbage collection is called during this critical section,
12887 we simply return. #### We should abort instead.
12888
12889 #### If a frame-size change does occur we should probably
12890 actually be preempting redisplay.
12891
12892 @node Line Start Cache, Redisplay Piece by Piece, Critical Redisplay Sections, The Redisplay Mechanism
12893 @section Line Start Cache
12894 @cindex line start cache
12895
12896 The traditional scrolling code in Emacs breaks in a variable height
12897 world. It depends on the key assumption that the number of lines that
12898 can be displayed at any given time is fixed. This led to a complete
12899 separation of the scrolling code from the redisplay code. In order to
12900 fully support variable height lines, the scrolling code must actually be
12901 tightly integrated with redisplay. Only redisplay can determine how
12902 many lines will be displayed on a screen for any given starting point.
12903
12904 What is ideally wanted is a complete list of the starting buffer
12905 position for every possible display line of a buffer along with the
12906 height of that display line. Maintaining such a full list would be very
12907 expensive. We settle for having it include information for all areas
12908 which we happen to generate anyhow (i.e. the region currently being
12909 displayed) and for those areas we need to work with.
12910
12911 In order to ensure that the cache accurately represents what redisplay
12912 would actually show, it is necessary to invalidate it in many
12913 situations. If the buffer changes, the starting positions may no longer
12914 be correct. If a face or an extent has changed then the line heights
12915 may have altered. These events happen frequently enough that the cache
12916 can end up being constantly disabled. With this potentially constant
12917 invalidation when is the cache ever useful?
12918
12919 Even if the cache is invalidated before every single usage, it is
12920 necessary. Scrolling often requires knowledge about display lines which
12921 are actually above or below the visible region. The cache provides a
12922 convenient light-weight method of storing this information for multiple
12923 display regions. This knowledge is necessary for the scrolling code to
12924 always obey the First Golden Rule of Redisplay.
12925
12926 If the cache already contains all of the information that the scrolling
12927 routines happen to need so that it doesn't have to go generate it, then
12928 we are able to obey the Third Golden Rule of Redisplay. The first thing
12929 we do to help out the cache is to always add the displayed region. This
12930 region had to be generated anyway, so the cache ends up getting the
12931 information basically for free. In those cases where a user is simply
12932 scrolling around viewing a buffer there is a high probability that this
12933 is sufficient to always provide the needed information. The second
12934 thing we can do is be smart about invalidating the cache.
12935
12936 TODO---Be smart about invalidating the cache. Potential places:
12937
12938 @itemize @bullet
12939 @item
12940 Insertions at end-of-line which don't cause line-wraps do not alter the
12941 starting positions of any display lines. These types of buffer
12942 modifications should not invalidate the cache. This is actually a large
12943 optimization for redisplay speed as well.
12944 @item
12945 Buffer modifications frequently only affect the display of lines at and
12946 below where they occur. In these situations we should only invalidate
12947 the part of the cache starting at where the modification occurs.
12948 @end itemize
12949
12950 In case you're wondering, the Second Golden Rule of Redisplay is not
12951 applicable.
12952
12953 @node Redisplay Piece by Piece, Modules for the Redisplay Mechanism, Line Start Cache, The Redisplay Mechanism
12954 @section Redisplay Piece by Piece
12955 @cindex redisplay piece by piece
12956
12957 As you can begin to see redisplay is complex and also not well
12958 documented. Chuck no longer works on XEmacs so this section is my take
12959 on the workings of redisplay.
12960
12961 Redisplay happens in three phases:
12962
12963 @enumerate
12964 @item
12965 Determine desired display in area that needs redisplay.
12966 Implemented by @code{redisplay.c}
12967 @item
12968 Compare desired display with current display
12969 Implemented by @code{redisplay-output.c}
12970 @item
12971 Output changes Implemented by @code{redisplay-output.c},
12972 @code{redisplay-x.c}, @code{redisplay-msw.c} and @code{redisplay-tty.c}
12973 @end enumerate
12974
12975 Steps 1 and 2 are device-independent and relatively complex. Step 3 is
12976 mostly device-dependent.
12977
12978 Determining the desired display
12979
12980 Display attributes are stored in @code{display_line} structures. Each
12981 @code{display_line} consists of a set of @code{display_block}'s and each
12982 @code{display_block} contains a number of @code{rune}'s. Generally
12983 dynarr's of @code{display_line}'s are held by each window representing
12984 the current display and the desired display.
12985
12986 The @code{display_line} structures are tightly tied to buffers which
12987 presents a problem for redisplay as this connection is bogus for the
12988 modeline. Hence the @code{display_line} generation routines are
12989 duplicated for generating the modeline. This means that the modeline
12990 display code has many bugs that the standard redisplay code does not.
12991
12992 The guts of @code{display_line} generation are in
12993 @code{create_text_block}, which creates a single display line for the
12994 desired locale. This incrementally parses the characters on the current
12995 line and generates redisplay structures for each.
12996
12997 Gutter redisplay is different. Because the data to display is stored in
12998 a string we cannot use @code{create_text_block}. Instead we use
12999 @code{create_text_string_block} which performs the same function as
13000 @code{create_text_block} but for strings. Many of the complexities of
13001 @code{create_text_block} to do with cursor handling and selective
13002 display have been removed.
13003
13004 @node Modules for the Redisplay Mechanism, Modules for other Display-Related Lisp Objects, Redisplay Piece by Piece, The Redisplay Mechanism
13005 @section Modules for the Redisplay Mechanism
13006 @cindex modules for the redisplay mechanism
13007 @cindex redisplay mechanism, modules for the
13008
13009 @example
13010 @file{redisplay-output.c}
13011 @file{redisplay-msw.c}
13012 @file{redisplay-tty.c}
13013 @file{redisplay-x.c}
13014 @file{redisplay.c}
13015 @file{redisplay.h}
13016 @end example
13017
13018 These files provide the redisplay mechanism. As with many other
13019 subsystems in XEmacs, there is a clean separation between the general
13020 and device-specific support.
13021
13022 @file{redisplay.c} contains the bulk of the redisplay engine. These
13023 functions update the redisplay structures (which describe how the screen
13024 is to appear) to reflect any changes made to the state of any
13025 displayable objects (buffer, frame, window, etc.) since the last time
13026 that redisplay was called. These functions are highly optimized to
13027 avoid doing more work than necessary (since redisplay is called
13028 extremely often and is potentially a huge time sink), and depend heavily
13029 on notifications from the objects themselves that changes have occurred,
13030 so that redisplay doesn't explicitly have to check each possible object.
13031 The redisplay mechanism also contains a great deal of caching to further
13032 speed things up; some of this caching is contained within the various
13033 displayable objects.
13034
13035 @file{redisplay-output.c} goes through the redisplay structures and converts
13036 them into calls to device-specific methods to actually output the screen
13037 changes.
13038
13039 @file{redisplay-x.c} and @file{redisplay-tty.c} are two implementations
13040 of these redisplay output methods, for X frames and TTY frames,
13041 respectively.
13042
13043
13044
13045 @example
13046 @file{indent.c}
13047 @end example
13048
13049 This module contains various functions and Lisp primitives for
13050 converting between buffer positions and screen positions. These
13051 functions call the redisplay mechanism to do most of the work, and then
13052 examine the redisplay structures to get the necessary information. This
13053 module needs work.
13054
13055
13056
13057 @example
13058 @file{termcap.c}
13059 @file{terminfo.c}
13060 @file{tparam.c}
13061 @end example
13062
13063 These files contain functions for working with the termcap (BSD-style)
13064 and terminfo (System V style) databases of terminal capabilities and
13065 escape sequences, used when XEmacs is displaying in a TTY.
13066
13067
13068
13069 @example
13070 @file{cm.c}
13071 @file{cm.h}
13072 @end example
13073
13074 These files provide some miscellaneous TTY-output functions and should
13075 probably be merged into @file{redisplay-tty.c}.
13076
13077
13078
13079 @node Modules for other Display-Related Lisp Objects, , Modules for the Redisplay Mechanism, The Redisplay Mechanism
13080 @section Modules for other Display-Related Lisp Objects
13081 @cindex modules for other display-related Lisp objects
13082 @cindex display-related Lisp objects, modules for other
13083 @cindex Lisp objects, modules for other display-related
13084
13085 @example
13086 @file{faces.c}
13087 @file{faces.h}
13088 @end example
13089
13090
13091
13092 @example
13093 @file{bitmaps.h}
13094 @file{glyphs-eimage.c}
13095 @file{glyphs-msw.c}
13096 @file{glyphs-msw.h}
13097 @file{glyphs-widget.c}
13098 @file{glyphs-x.c}
13099 @file{glyphs-x.h}
13100 @file{glyphs.c}
13101 @file{glyphs.h}
13102 @end example
13103
13104
13105
13106 @example
13107 @file{objects-msw.c}
13108 @file{objects-msw.h}
13109 @file{objects-tty.c}
13110 @file{objects-tty.h}
13111 @file{objects-x.c}
13112 @file{objects-x.h}
13113 @file{objects.c}
13114 @file{objects.h}
13115 @end example
13116
13117
13118
13119 @example
13120 @file{menubar-msw.c}
13121 @file{menubar-msw.h}
13122 @file{menubar-x.c}
13123 @file{menubar.c}
13124 @file{menubar.h}
13125 @end example
13126
13127
13128
13129 @example
13130 @file{scrollbar-msw.c}
13131 @file{scrollbar-msw.h}
13132 @file{scrollbar-x.c}
13133 @file{scrollbar-x.h}
13134 @file{scrollbar.c}
13135 @file{scrollbar.h}
13136 @end example
13137
13138
13139
13140 @example
13141 @file{toolbar-msw.c}
13142 @file{toolbar-x.c}
13143 @file{toolbar.c}
13144 @file{toolbar.h}
13145 @end example
13146
13147
13148
13149 @example
13150 @file{font-lock.c}
13151 @end example
13152
13153 This file provides C support for syntax highlighting---i.e.
13154 highlighting different syntactic constructs of a source file in
13155 different colors, for easy reading. The C support is provided so that
13156 this is fast.
13157
13158
13159
13160 @example
13161 @file{dgif_lib.c}
13162 @file{gif_err.c}
13163 @file{gif_lib.h}
13164 @file{gifalloc.c}
13165 @end example
13166
13167 These modules decode GIF-format image files, for use with glyphs.
13168 These files were removed due to Unisys patent infringement concerns.
13169
13170
13171 @node Extents, Faces, The Redisplay Mechanism, Top
13172 @chapter Extents
13173 @cindex extents
13174
13175 @menu
13176 * Introduction to Extents:: Extents are ranges over text, with properties.
13177 * Extent Ordering:: How extents are ordered internally.
13178 * Format of the Extent Info:: The extent information in a buffer or string.
13179 * Zero-Length Extents:: A weird special case.
13180 * Mathematics of Extent Ordering:: A rigorous foundation.
13181 * Extent Fragments:: Cached information useful for redisplay.
13182 @end menu
13183
13184 @node Introduction to Extents, Extent Ordering, Extents, Extents
13185 @section Introduction to Extents
13186 @cindex extents, introduction to
13187
13188 Extents are regions over a buffer, with a start and an end position
13189 denoting the region of the buffer included in the extent. In
13190 addition, either end can be closed or open, meaning that the endpoint
13191 is or is not logically included in the extent. Insertion of a character
13192 at a closed endpoint causes the character to go inside the extent;
13193 insertion at an open endpoint causes the character to go outside.
13194
13195 Extent endpoints are stored using memory indices (see @file{insdel.c}),
13196 to minimize the amount of adjusting that needs to be done when
13197 characters are inserted or deleted.
13198
13199 (Formerly, extent endpoints at the gap could be either before or
13200 after the gap, depending on the open/closedness of the endpoint.
13201 The intent of this was to make it so that insertions would
13202 automatically go inside or out of extents as necessary with no
13203 further work needing to be done. It didn't work out that way,
13204 however, and just ended up complexifying and buggifying all the
13205 rest of the code.)
13206
13207 @node Extent Ordering, Format of the Extent Info, Introduction to Extents, Extents
13208 @section Extent Ordering
13209 @cindex extent ordering
13210
13211 Extents are compared using memory indices. There are two orderings
13212 for extents and both orders are kept current at all times. The normal
13213 or @dfn{display} order is as follows:
13214
13215 @example
13216 Extent A is ``less than'' extent B,
13217 that is, earlier in the display order,
13218 if: A-start < B-start,
13219 or if: A-start = B-start, and A-end > B-end
13220 @end example
13221
13222 So if two extents begin at the same position, the larger of them is the
13223 earlier one in the display order (@code{EXTENT_LESS} is true).
13224
13225 For the e-order, the same thing holds:
13226
13227 @example
13228 Extent A is ``less than'' extent B in e-order,
13229 that is, later in the buffer,
13230 if: A-end < B-end,
13231 or if: A-end = B-end, and A-start > B-start
13232 @end example
13233
13234 So if two extents end at the same position, the smaller of them is the
13235 earlier one in the e-order (@code{EXTENT_E_LESS} is true).
13236
13237 The display order and the e-order are complementary orders: any
13238 theorem about the display order also applies to the e-order if you swap
13239 all occurrences of ``display order'' and ``e-order'', ``less than'' and
13240 ``greater than'', and ``extent start'' and ``extent end''.
13241
13242 @node Format of the Extent Info, Zero-Length Extents, Extent Ordering, Extents
13243 @section Format of the Extent Info
13244 @cindex extent info, format of the
13245
13246 An extent-info structure consists of a list of the buffer or string's
13247 extents and a @dfn{stack of extents} that lists all of the extents over
13248 a particular position. The stack-of-extents info is used for
13249 optimization purposes---it basically caches some info that might
13250 be expensive to compute. Certain otherwise hard computations are easy
13251 given the stack of extents over a particular position, and if the
13252 stack of extents over a nearby position is known (because it was
13253 calculated at some prior point in time), it's easy to move the stack
13254 of extents to the proper position.
13255
13256 Given that the stack of extents is an optimization, and given that
13257 it requires memory, a string's stack of extents is wiped out each
13258 time a garbage collection occurs. Therefore, any time you retrieve
13259 the stack of extents, it might not be there. If you need it to
13260 be there, use the @code{_force} version.
13261
13262 Similarly, a string may or may not have an extent_info structure.
13263 (Generally it won't if there haven't been any extents added to the
13264 string.) So use the @code{_force} version if you need the extent_info
13265 structure to be there.
13266
13267 A list of extents is maintained as a double gap array. One gap array
13268 is ordered by start index (the @dfn{display order}) and the other is
13269 ordered by end index (the @dfn{e-order}). Note that positions in an
13270 extent list should logically be conceived of as referring @emph{to} a
13271 particular extent (as is the norm in programs) rather than sitting
13272 between two extents. Note also that callers of these functions should
13273 not be aware of the fact that the extent list is implemented as an
13274 array, except for the fact that positions are integers (this should be
13275 generalized to handle integers and linked list equally well).
13276
13277 A gap array is the same structure used by buffer text: an array of
13278 elements with a "gap" somewhere in the middle. Insertion and deletion
13279 happens by moving the gap to the insertion/deletion point, and then
13280 expanding/contracting as necessary. Gap arrays have a number of
13281 useful properties:
13282
13283 @enumerate
13284 @item
13285 They are space efficient, as there is no need for next/previous pointers.
13286
13287 @item
13288 If the items in them are sorted, locating an item is fast -- @math{O(log N)}.
13289
13290 @item
13291 Insertion and deletion is very fast (constant time, essentially) if the
13292 gap is near (which favors localized operations, as will usually be the
13293 case). Even if not, it requires only a block move of memory, which is
13294 generally a highly optimized operation on modern processors.
13295
13296 @item
13297 Code to manipulate them is relatively simple to write.
13298 @end enumerate
13299
13300 An alternative would be balanced binary trees, which have guaranteed
13301 @math{O(log N)} time for all operations (although the constant factors
13302 are not as good, and repeated localized operations will be slower than
13303 for a gap array). Such code is quite tricky to write, however.
13304
13305 @node Zero-Length Extents, Mathematics of Extent Ordering, Format of the Extent Info, Extents
13306 @section Zero-Length Extents
13307 @cindex zero-length extents
13308 @cindex extents, zero-length
13309
13310 Extents can be zero-length, and will end up that way if their endpoints
13311 are explicitly set that way or if their detachable property is @code{nil}
13312 and all the text in the extent is deleted. (The exception is open-open
13313 zero-length extents, which are barred from existing because there is
13314 no sensible way to define their properties. Deletion of the text in
13315 an open-open extent causes it to be converted into a closed-open
13316 extent.) Zero-length extents are primarily used to represent
13317 annotations, and behave as follows:
13318
13319 @enumerate
13320 @item
13321 Insertion at the position of a zero-length extent expands the extent
13322 if both endpoints are closed; goes after the extent if it is closed-open;
13323 and goes before the extent if it is open-closed.
13324
13325 @item
13326 Deletion of a character on a side of a zero-length extent whose
13327 corresponding endpoint is closed causes the extent to be detached if
13328 it is detachable; if the extent is not detachable or the corresponding
13329 endpoint is open, the extent remains in the buffer, moving as necessary.
13330 @end enumerate
13331
13332 Note that closed-open, non-detachable zero-length extents behave
13333 exactly like markers and that open-closed, non-detachable zero-length
13334 extents behave like the ``point-type'' marker in Mule.
13335
13336 @node Mathematics of Extent Ordering, Extent Fragments, Zero-Length Extents, Extents
13337 @section Mathematics of Extent Ordering
13338 @cindex mathematics of extent ordering
13339 @cindex extent mathematics
13340 @cindex extent ordering
13341
13342 @cindex display order of extents
13343 @cindex extents, display order
13344 The extents in a buffer are ordered by ``display order'' because that
13345 is that order that the redisplay mechanism needs to process them in.
13346 The e-order is an auxiliary ordering used to facilitate operations
13347 over extents. The operations that can be performed on the ordered
13348 list of extents in a buffer are
13349
13350 @enumerate
13351 @item
13352 Locate where an extent would go if inserted into the list.
13353 @item
13354 Insert an extent into the list.
13355 @item
13356 Remove an extent from the list.
13357 @item
13358 Map over all the extents that overlap a range.
13359 @end enumerate
13360
13361 (4) requires being able to determine the first and last extents
13362 that overlap a range.
13363
13364 NOTE: @dfn{overlap} is used as follows:
13365
13366 @itemize @bullet
13367 @item
13368 two ranges overlap if they have at least one point in common.
13369 Whether the endpoints are open or closed makes a difference here.
13370 @item
13371 a point overlaps a range if the point is contained within the
13372 range; this is equivalent to treating a point @math{P} as the range
13373 @math{[P, P]}.
13374 @item
13375 In the case of an @emph{extent} overlapping a point or range, the extent
13376 is normally treated as having closed endpoints. This applies
13377 consistently in the discussion of stacks of extents and such below.
13378 Note that this definition of overlap is not necessarily consistent with
13379 the extents that @code{map-extents} maps over, since @code{map-extents}
13380 sometimes pays attention to whether the endpoints of an extents are open
13381 or closed. But for our purposes, it greatly simplifies things to treat
13382 all extents as having closed endpoints.
13383 @end itemize
13384
13385 First, define @math{>}, @math{<}, @math{<=}, etc. as applied to extents
13386 to mean comparison according to the display order. Comparison between
13387 an extent @math{E} and an index @math{I} means comparison between
13388 @math{E} and the range @math{[I, I]}.
13389
13390 Also define @math{e>}, @math{e<}, @math{e<=}, etc. to mean comparison
13391 according to the e-order.
13392
13393 For any range @math{R}, define @math{R(0)} to be the starting index of
13394 the range and @math{R(1)} to be the ending index of the range.
13395
13396 For any extent @math{E}, define @math{E(next)} to be the extent directly
13397 following @math{E}, and @math{E(prev)} to be the extent directly
13398 preceding @math{E}. Assume @math{E(next)} and @math{E(prev)} can be
13399 determined from @math{E} in constant time. (This is because we store
13400 the extent list as a doubly linked list.)
13401
13402 Similarly, define @math{E(e-next)} and @math{E(e-prev)} to be the
13403 extents directly following and preceding @math{E} in the e-order.
13404
13405 Now:
13406
13407 Let @math{R} be a range.
13408 Let @math{F} be the first extent overlapping @math{R}.
13409 Let @math{L} be the last extent overlapping @math{R}.
13410
13411 Theorem 1: @math{R(1)} lies between @math{L} and @math{L(next)},
13412 i.e. @math{L <= R(1) < L(next)}.
13413
13414 This follows easily from the definition of display order. The
13415 basic reason that this theorem applies is that the display order
13416 sorts by increasing starting index.
13417
13418 Therefore, we can determine @math{L} just by looking at where we would
13419 insert @math{R(1)} into the list, and if we know @math{F} and are moving
13420 forward over extents, we can easily determine when we've hit @math{L} by
13421 comparing the extent we're at to @math{R(1)}.
13422
13423 @example
13424 Theorem 2: @math{F(e-prev) e< [1, R(0)] e<= F}.
13425 @end example
13426
13427 This is the analog of Theorem 1, and applies because the e-order
13428 sorts by increasing ending index.
13429
13430 Therefore, @math{F} can be found in the same amount of time as
13431 operation (1), i.e. the time that it takes to locate where an extent
13432 would go if inserted into the e-order list. This is @math{O(log N)},
13433 since we are using gap arrays to manage extents.
13434
13435 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents
13436 (ordered in display order and e-order, just like for normal extent
13437 lists) that overlap an index @math{I}.
13438
13439 Now:
13440
13441 Let @math{I} be an index, let @math{S} be the stack of extents on
13442 @math{I} and let @math{F} be the first extent in @math{S}.
13443
13444 Theorem 3: The first extent in @math{S} is the first extent that overlaps
13445 any range @math{[I, J]}.
13446
13447 Proof: Any extent that overlaps @math{[I, J]} but does not include
13448 @math{I} must have a start index @math{> I}, and thus be greater than
13449 any extent in @math{S}.
13450
13451 Therefore, finding the first extent that overlaps a range @math{R} is
13452 the same as finding the first extent that overlaps @math{R(0)}.
13453
13454 Theorem 4: Let @math{I2} be an index such that @math{I2 > I}, and let
13455 @math{F2} be the first extent that overlaps @math{I2}. Then, either
13456 @math{F2} is in @math{S} or @math{F2} is greater than any extent in
13457 @math{S}.
13458
13459 Proof: If @math{F2} does not include @math{I} then its start index is
13460 greater than @math{I} and thus it is greater than any extent in
13461 @math{S}, including @math{F}. Otherwise, @math{F2} includes @math{I}
13462 and thus is in @math{S}, and thus @math{F2 >= F}.
13463
13464 @node Extent Fragments, , Mathematics of Extent Ordering, Extents
13465 @section Extent Fragments
13466 @cindex extent fragments
13467 @cindex fragments, extent
13468
13469 Imagine that the buffer is divided up into contiguous, non-overlapping
13470 @dfn{runs} of text such that no extent starts or ends within a run
13471 (extents that abut the run don't count).
13472
13473 An extent fragment is a structure that holds data about the run that
13474 contains a particular buffer position (if the buffer position is at the
13475 junction of two runs, the run after the position is used)---the
13476 beginning and end of the run, a list of all of the extents in that run,
13477 the @dfn{merged face} that results from merging all of the faces
13478 corresponding to those extents, the begin and end glyphs at the
13479 beginning of the run, etc. This is the information that redisplay needs
13480 in order to display this run.
13481
13482 Extent fragments have to be very quick to update to a new buffer
13483 position when moving linearly through the buffer. They rely on the
13484 stack-of-extents code, which does the heavy-duty algorithmic work of
13485 determining which extents overly a particular position.
13486
13487 @node Faces, Glyphs, Extents, Top
13488 @chapter Faces
13489 @cindex faces
13490
13491 Not yet documented.
13492
13493 @node Glyphs, Specifiers, Faces, Top
13494 @chapter Glyphs
13495 @cindex glyphs
13496
13497 Glyphs are graphical elements that can be displayed in XEmacs buffers or
13498 gutters. We use the term graphical element here in the broadest possible
13499 sense since glyphs can be as mundane as text or as arcane as a native
13500 tab widget.
13501
13502 In XEmacs, glyphs represent the uninstantiated state of graphical
13503 elements, i.e. they hold all the information necessary to produce an
13504 image on-screen but the image need not exist at this stage, and multiple
13505 screen images can be instantiated from a single glyph.
13506
13507 @c #### find a place for this discussion
13508 @c The decision to make image specifiers a separate type is debatable.
13509 @c In fact, the design decision to create a separate image specifier
13510 @c type, rather than make glyphs themselves be specifiers, is
13511 @c debatable---the other properties of glyphs are rarely used and could
13512 @c conceivably have been incorporated into the glyph's instantiator.
13513 @c The rarely used glyph types (buffer, pointer, icon) could also have
13514 @c been incorporated into the instantiator.
13515
13516 Glyphs are lazily instantiated by calling one of the glyph
13517 functions. This usually occurs within redisplay when
13518 @code{Fglyph_height} is called. Instantiation causes an image-instance
13519 to be created and cached. This cache is on a per-device basis for all glyphs
13520 except widget-glyphs, and on a per-window basis for widgets-glyphs. The
13521 caching is done by @code{image_instantiate} and is necessary because it
13522 is generally possible to display an image-instance in multiple
13523 domains. For instance if we create a Pixmap, we can actually display
13524 this on multiple windows - even though we only need a single Pixmap
13525 instance to do this. If caching wasn't done then it would be necessary
13526 to create image-instances for every displayable occurrence of a glyph -
13527 and every usage - and this would be extremely memory and cpu intensive.
13528
13529 Widget-glyphs (a.k.a native widgets) are not cached in this way. This is
13530 because widget-glyph image-instances on screen are toolkit windows, and
13531 thus cannot be reused in multiple XEmacs domains. Thus widget-glyphs are
13532 cached on an XEmacs window basis.
13533
13534 Any action on a glyph first consults the cache before actually
13535 instantiating a widget.
13536
13537 @section Glyph Instantiation
13538 @cindex glyph instantiation
13539 @cindex instantiation, glyph
13540
13541 Glyph instantiation is a hairy topic and requires some explanation. The
13542 guts of glyph instantiation is contained within
13543 @code{image_instantiate}. A glyph contains an image which is a
13544 specifier. When a glyph function - for instance @code{Fglyph_height} -
13545 asks for a property of the glyph that can only be determined from its
13546 instantiated state, then the glyph image is instantiated and an image
13547 instance created. The instantiation process is governed by the specifier
13548 code and goes through a series of steps:
13549
13550 @itemize @bullet
13551 @item
13552 Validation. Instantiation of image instances happens dynamically - often
13553 within the guts of redisplay. Thus it is often not feasible to catch
13554 instantiator errors at instantiation time. Instead the instantiator is
13555 validated at the time it is added to the image specifier. This function
13556 is defined by @code{image_validate} and at a simple level validates
13557 keyword value pairs.
13558 @item
13559 Duplication. The specifier code by default takes a copy of the
13560 instantiator. This is reasonable for most specifiers but in the case of
13561 widget-glyphs can be problematic, since some of the properties in the
13562 instantiator - for instance callbacks - could cause infinite recursion
13563 in the copying process. Thus the image code defines a function -
13564 @code{image_copy_instantiator} - which will selectively copy values.
13565 This is controlled by the way that a keyword is defined either using
13566 @code{IIFORMAT_VALID_KEYWORD} or
13567 @code{IIFORMAT_VALID_NONCOPY_KEYWORD}. Note that the image caching and
13568 redisplay code relies on instantiator copying to ensure that current and
13569 new instantiators are actually different rather than referring to the
13570 same thing.
13571 @item
13572 Normalization. Once the instantiator has been copied it must be
13573 converted into a form that is viable at instantiation time. This can
13574 involve no changes at all, but typically involves things like converting
13575 file names to the actual data. This function is defined by
13576 @code{image_going_to_add} and @code{normalize_image_instantiator}.
13577 @item
13578 Instantiation. When an image instance is actually required for display
13579 it is instantiated using @code{image_instantiate}. This involves calling
13580 instantiate methods that are specific to the type of image being
13581 instantiated.
13582 @end itemize
13583
13584 The final instantiation phase also involves a number of steps. In order
13585 to understand these we need to describe a number of concepts.
13586
13587 An image is instantiated in a @dfn{domain}, where a domain can be any
13588 one of a device, frame, window or image-instance. The domain gives the
13589 image-instance context and identity and properties that affect the
13590 appearance of the image-instance may be different for the same glyph
13591 instantiated in different domains. An example is the face used to
13592 display the image-instance.
13593
13594 Although an image is instantiated in a particular domain the
13595 instantiation domain is not necessarily the domain in which the
13596 image-instance is cached. For example a pixmap can be instantiated in a
13597 window be actually be cached on a per-device basis. The domain in which
13598 the image-instance is actually cached is called the
13599 @dfn{governing-domain}. A governing-domain is currently either a device
13600 or a window. Widget-glyphs and text-glyphs have a window as a
13601 governing-domain, all other image-instances have a device as the
13602 governing-domain. The governing domain for an image-instance is
13603 determined using the governing_domain image-instance method.
13604
13605 @section Widget-Glyphs
13606 @cindex widget-glyphs
13607
13608 @section Widget-Glyphs in the MS-Windows Environment
13609 @cindex widget-glyphs in the MS-Windows environment
13610 @cindex MS-Windows environment, widget-glyphs in the
13611
13612 To Do
13613
13614 @section Widget-Glyphs in the X Environment
13615 @cindex widget-glyphs in the X environment
13616 @cindex X environment, widget-glyphs in the
13617
13618 Widget-glyphs under X make heavy use of lwlib (@pxref{Lucid Widget
13619 Library}) for manipulating the native toolkit objects. This is primarily
13620 so that different toolkits can be supported for widget-glyphs, just as
13621 they are supported for features such as menubars etc.
13622
13623 Lwlib is extremely poorly documented and quite hairy so here is my
13624 understanding of what goes on.
13625
13626 Lwlib maintains a set of widget_instances which mirror the hierarchical
13627 state of Xt widgets. I think this is so that widgets can be updated and
13628 manipulated generically by the lwlib library. For instance
13629 update_one_widget_instance can cope with multiple types of widget and
13630 multiple types of toolkit. Each element in the widget hierarchy is updated
13631 from its corresponding widget_instance by walking the widget_instance
13632 tree recursively.
13633
13634 This has desirable properties such as lw_modify_all_widgets which is
13635 called from @file{glyphs-x.c} and updates all the properties of a widget
13636 without having to know what the widget is or what toolkit it is from.
13637 Unfortunately this also has hairy properties such as making the lwlib
13638 code quite complex. And of course lwlib has to know at some level what
13639 the widget is and how to set its properties.
13640
13641 @node Specifiers, Menus, Glyphs, Top
13642 @chapter Specifiers
13643 @cindex specifiers
13644
13645 Not yet documented.
13646
13647 Specifiers are documented in depth in the Lisp Reference manual.
13648 @xref{Specifiers,,, lispref, XEmacs Lisp Reference Manual}. The code in
13649 @file{specifier.c} is pretty straightforward.
13650
13651 @node Menus, Events and the Event Loop, Specifiers, Top
13652 @chapter Menus
13653 @cindex menus
13654
13655 A menu is set by setting the value of the variable
13656 @code{current-menubar} (which may be buffer-local) and then calling
13657 @code{set-menubar-dirty-flag} to signal a change. This will cause the
13658 menu to be redrawn at the next redisplay. The format of the data in
13659 @code{current-menubar} is described in @file{menubar.c}.
13660
13661 Internally the data in current-menubar is parsed into a tree of
13662 @code{widget_value's} (defined in @file{lwlib.h}); this is accomplished
13663 by the recursive function @code{menu_item_descriptor_to_widget_value()},
13664 called by @code{compute_menubar_data()}. Such a tree is deallocated
13665 using @code{free_widget_value()}.
13666
13667 @code{update_screen_menubars()} is one of the external entry points.
13668 This checks to see, for each screen, if that screen's menubar needs to
13669 be updated. This is the case if
13670
13671 @enumerate
13672 @item
13673 @code{set-menubar-dirty-flag} was called since the last redisplay. (This
13674 function sets the C variable menubar_has_changed.)
13675 @item
13676 The buffer displayed in the screen has changed.
13677 @item
13678 The screen has no menubar currently displayed.
13679 @end enumerate
13680
13681 @code{set_screen_menubar()} is called for each such screen. This
13682 function calls @code{compute_menubar_data()} to create the tree of
13683 widget_value's, then calls @code{lw_create_widget()},
13684 @code{lw_modify_all_widgets()}, and/or @code{lw_destroy_all_widgets()}
13685 to create the X-Toolkit widget associated with the menu.
13686
13687 @code{update_psheets()}, the other external entry point, actually
13688 changes the menus being displayed. It uses the widgets fixed by
13689 @code{update_screen_menubars()} and calls various X functions to ensure
13690 that the menus are displayed properly.
13691
13692 The menubar widget is set up so that @code{pre_activate_callback()} is
13693 called when the menu is first selected (i.e. mouse button goes down),
13694 and @code{menubar_selection_callback()} is called when an item is
13695 selected. @code{pre_activate_callback()} calls the function in
13696 activate-menubar-hook, which can change the menubar (this is described
13697 in @file{menubar.c}). If the menubar is changed,
13698 @code{set_screen_menubars()} is called.
13699 @code{menubar_selection_callback()} enqueues a menu event, putting in it
13700 a function to call (either @code{eval} or @code{call-interactively}) and
13701 its argument, which is the callback function or form given in the menu's
13702 description.
13703
13704 @node Events and the Event Loop, Asynchronous Events; Quit Checking, Menus, Top
7475 @chapter Events and the Event Loop 13705 @chapter Events and the Event Loop
7476 @cindex events and the event loop 13706 @cindex events and the event loop
7477 @cindex event loop, events and the 13707 @cindex event loop, events and the
7478 13708
7479 @menu 13709 @menu
8284 the only code remaining is code to call out to Lisp or provide simple 14514 the only code remaining is code to call out to Lisp or provide simple
8285 bootstrapping implementations early in temacs, before the echo-area Lisp 14515 bootstrapping implementations early in temacs, before the echo-area Lisp
8286 code is loaded). 14516 code is loaded).
8287 14517
8288 14518
8289 @node Asynchronous Events; Quit Checking, Evaluation; Stack Frames; Bindings, Events and the Event Loop, Top 14519 @node Asynchronous Events; Quit Checking, Lstreams, Events and the Event Loop, Top
8290 @chapter Asynchronous Events; Quit Checking 14520 @chapter Asynchronous Events; Quit Checking
8291 @cindex asynchronous events; quit checking 14521 @cindex asynchronous events; quit checking
8292 @cindex asynchronous events 14522 @cindex asynchronous events
8293 14523
8294 @menu 14524 @menu
8610 @item 14840 @item
8611 printing code does not do code conversion or gettext when 14841 printing code does not do code conversion or gettext when
8612 printing to stdout/stderr. 14842 printing to stdout/stderr.
8613 @end itemize 14843 @end itemize
8614 14844
8615 @node Evaluation; Stack Frames; Bindings, Symbols and Variables, Asynchronous Events; Quit Checking, Top 14845 @node Lstreams, Subprocesses, Asynchronous Events; Quit Checking, Top
8616 @chapter Evaluation; Stack Frames; Bindings
8617 @cindex evaluation; stack frames; bindings
8618 @cindex stack frames; bindings, evaluation;
8619 @cindex bindings, evaluation; stack frames;
8620
8621 @menu
8622 * Evaluation::
8623 * Dynamic Binding; The specbinding Stack; Unwind-Protects::
8624 * Simple Special Forms::
8625 * Catch and Throw::
8626 @end menu
8627
8628 @node Evaluation, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings, Evaluation; Stack Frames; Bindings
8629 @section Evaluation
8630 @cindex evaluation
8631
8632 @code{Feval()} evaluates the form (a Lisp object) that is passed to
8633 it. Note that evaluation is only non-trivial for two types of objects:
8634 symbols and conses. A symbol is evaluated simply by calling
8635 @code{symbol-value} on it and returning the value.
8636
8637 Evaluating a cons means calling a function. First, @code{eval} checks
8638 to see if garbage-collection is necessary, and calls
8639 @code{garbage_collect_1()} if so. It then increases the evaluation
8640 depth by 1 (@code{lisp_eval_depth}, which is always less than
8641 @code{max_lisp_eval_depth}) and adds an element to the linked list of
8642 @code{struct backtrace}'s (@code{backtrace_list}). Each such structure
8643 contains a pointer to the function being called plus a list of the
8644 function's arguments. Originally these values are stored unevalled, and
8645 as they are evaluated, the backtrace structure is updated. Garbage
8646 collection pays attention to the objects pointed to in the backtrace
8647 structures (garbage collection might happen while a function is being
8648 called or while an argument is being evaluated, and there could easily
8649 be no other references to the arguments in the argument list; once an
8650 argument is evaluated, however, the unevalled version is not needed by
8651 eval, and so the backtrace structure is changed).
8652
8653 At this point, the function to be called is determined by looking at
8654 the car of the cons (if this is a symbol, its function definition is
8655 retrieved and the process repeated). The function should then consist
8656 of either a @code{Lisp_Subr} (built-in function written in C), a
8657 @code{Lisp_Compiled_Function} object, or a cons whose car is one of the
8658 symbols @code{autoload}, @code{macro} or @code{lambda}.
8659
8660 If the function is a @code{Lisp_Subr}, the lisp object points to a
8661 @code{struct Lisp_Subr} (created by @code{DEFUN()}), which contains a
8662 pointer to the C function, a minimum and maximum number of arguments
8663 (or possibly the special constants @code{MANY} or @code{UNEVALLED}), a
8664 pointer to the symbol referring to that subr, and a couple of other
8665 things. If the subr wants its arguments @code{UNEVALLED}, they are
8666 passed raw as a list. Otherwise, an array of evaluated arguments is
8667 created and put into the backtrace structure, and either passed whole
8668 (@code{MANY}) or each argument is passed as a C argument.
8669
8670 If the function is a @code{Lisp_Compiled_Function},
8671 @code{funcall_compiled_function()} is called. If the function is a
8672 lambda list, @code{funcall_lambda()} is called. If the function is a
8673 macro, [..... fill in] is done. If the function is an autoload,
8674 @code{do_autoload()} is called to load the definition and then eval
8675 starts over [explain this more].
8676
8677 When @code{Feval()} exits, the evaluation depth is reduced by one, the
8678 debugger is called if appropriate, and the current backtrace structure
8679 is removed from the list.
8680
8681 Both @code{funcall_compiled_function()} and @code{funcall_lambda()} need
8682 to go through the list of formal parameters to the function and bind
8683 them to the actual arguments, checking for @code{&rest} and
8684 @code{&optional} symbols in the formal parameters and making sure the
8685 number of actual arguments is correct.
8686 @code{funcall_compiled_function()} can do this a little more
8687 efficiently, since the formal parameter list can be checked for sanity
8688 when the compiled function object is created.
8689
8690 @code{funcall_lambda()} simply calls @code{Fprogn} to execute the code
8691 in the lambda list.
8692
8693 @code{funcall_compiled_function()} calls the real byte-code interpreter
8694 @code{execute_optimized_program()} on the byte-code instructions, which
8695 are converted into an internal form for faster execution.
8696
8697 When a compiled function is executed for the first time by
8698 @code{funcall_compiled_function()}, or during the dump phase of building
8699 XEmacs, the byte-code instructions are converted from a
8700 @code{Lisp_String} (which is inefficient to access, especially in the
8701 presence of MULE) into a @code{Lisp_Opaque} object containing an array
8702 of unsigned char, which can be directly executed by the byte-code
8703 interpreter. At this time the byte code is also analyzed for validity
8704 and transformed into a more optimized form, so that
8705 @code{execute_optimized_program()} can really fly.
8706
8707 Here are some of the optimizations performed by the internal byte-code
8708 transformer:
8709 @enumerate
8710 @item
8711 References to the @code{constants} array are checked for out-of-range
8712 indices, so that the byte interpreter doesn't have to.
8713 @item
8714 References to the @code{constants} array that will be used as a Lisp
8715 variable are checked for being correct non-constant (i.e. not @code{t},
8716 @code{nil}, or @code{keywordp}) symbols, so that the byte interpreter
8717 doesn't have to.
8718 @item
8719 The maximum number of variable bindings in the byte-code is
8720 pre-computed, so that space on the @code{specpdl} stack can be
8721 pre-reserved once for the whole function execution.
8722 @item
8723 All byte-code jumps are relative to the current program counter instead
8724 of the start of the program, thereby saving a register.
8725 @item
8726 One-byte relative jumps are converted from the byte-code form of unsigned
8727 chars offset by 127 to machine-friendly signed chars.
8728 @end enumerate
8729
8730 Of course, this transformation of the @code{instructions} should not be
8731 visible to the user, so @code{Fcompiled_function_instructions()} needs
8732 to know how to convert the optimized opaque object back into a Lisp
8733 string that is identical to the original string from the @file{.elc}
8734 file. (Actually, the resulting string may (rarely) contain slightly
8735 different, yet equivalent, byte code.)
8736
8737 @code{Ffuncall()} implements Lisp @code{funcall}. @code{(funcall fun
8738 x1 x2 x3 ...)} is equivalent to @code{(eval (list fun (quote x1) (quote
8739 x2) (quote x3) ...))}. @code{Ffuncall()} contains its own code to do
8740 the evaluation, however, and is very similar to @code{Feval()}.
8741
8742 From the performance point of view, it is worth knowing that most of the
8743 time in Lisp evaluation is spent executing @code{Lisp_Subr} and
8744 @code{Lisp_Compiled_Function} objects via @code{Ffuncall()} (not
8745 @code{Feval()}).
8746
8747 @code{Fapply()} implements Lisp @code{apply}, which is very similar to
8748 @code{funcall} except that if the last argument is a list, the result is the
8749 same as if each of the arguments in the list had been passed separately.
8750 @code{Fapply()} does some business to expand the last argument if it's a
8751 list, then calls @code{Ffuncall()} to do the work.
8752
8753 @code{apply1()}, @code{call0()}, @code{call1()}, @code{call2()}, and
8754 @code{call3()} call a function, passing it the argument(s) given (the
8755 arguments are given as separate C arguments rather than being passed as
8756 an array). @code{apply1()} uses @code{Fapply()} while the others use
8757 @code{Ffuncall()} to do the real work.
8758
8759 @node Dynamic Binding; The specbinding Stack; Unwind-Protects, Simple Special Forms, Evaluation, Evaluation; Stack Frames; Bindings
8760 @section Dynamic Binding; The specbinding Stack; Unwind-Protects
8761 @cindex dynamic binding; the specbinding stack; unwind-protects
8762 @cindex binding; the specbinding stack; unwind-protects, dynamic
8763 @cindex specbinding stack; unwind-protects, dynamic binding; the
8764 @cindex unwind-protects, dynamic binding; the specbinding stack;
8765
8766 @example
8767 struct specbinding
8768 @{
8769 Lisp_Object symbol;
8770 Lisp_Object old_value;
8771 Lisp_Object (*func) (Lisp_Object); /* for unwind-protect */
8772 @};
8773 @end example
8774
8775 @code{struct specbinding} is used for local-variable bindings and
8776 unwind-protects. @code{specpdl} holds an array of @code{struct specbinding}'s,
8777 @code{specpdl_ptr} points to the beginning of the free bindings in the
8778 array, @code{specpdl_size} specifies the total number of binding slots
8779 in the array, and @code{max_specpdl_size} specifies the maximum number
8780 of bindings the array can be expanded to hold. @code{grow_specpdl()}
8781 increases the size of the @code{specpdl} array, multiplying its size by
8782 2 but never exceeding @code{max_specpdl_size} (except that if this
8783 number is less than 400, it is first set to 400).
8784
8785 @code{specbind()} binds a symbol to a value and is used for local
8786 variables and @code{let} forms. The symbol and its old value (which
8787 might be @code{Qunbound}, indicating no prior value) are recorded in the
8788 specpdl array, and @code{specpdl_size} is increased by 1.
8789
8790 @code{record_unwind_protect()} implements an @dfn{unwind-protect},
8791 which, when placed around a section of code, ensures that some specified
8792 cleanup routine will be executed even if the code exits abnormally
8793 (e.g. through a @code{throw} or quit). @code{record_unwind_protect()}
8794 simply adds a new specbinding to the @code{specpdl} array and stores the
8795 appropriate information in it. The cleanup routine can either be a C
8796 function, which is stored in the @code{func} field, or a @code{progn}
8797 form, which is stored in the @code{old_value} field.
8798
8799 @code{unbind_to()} removes specbindings from the @code{specpdl} array
8800 until the specified position is reached. Each specbinding can be one of
8801 three types:
8802
8803 @enumerate
8804 @item
8805 an unwind-protect with a C cleanup function (@code{func} is not 0, and
8806 @code{old_value} holds an argument to be passed to the function);
8807 @item
8808 an unwind-protect with a Lisp form (@code{func} is 0, @code{symbol}
8809 is @code{nil}, and @code{old_value} holds the form to be executed with
8810 @code{Fprogn()}); or
8811 @item
8812 a local-variable binding (@code{func} is 0, @code{symbol} is not
8813 @code{nil}, and @code{old_value} holds the old value, which is stored as
8814 the symbol's value).
8815 @end enumerate
8816
8817 @node Simple Special Forms, Catch and Throw, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings
8818 @section Simple Special Forms
8819 @cindex special forms, simple
8820
8821 @code{or}, @code{and}, @code{if}, @code{cond}, @code{progn},
8822 @code{prog1}, @code{prog2}, @code{setq}, @code{quote}, @code{function},
8823 @code{let*}, @code{let}, @code{while}
8824
8825 All of these are very simple and work as expected, calling
8826 @code{Feval()} or @code{Fprogn()} as necessary and (in the case of
8827 @code{let} and @code{let*}) using @code{specbind()} to create bindings
8828 and @code{unbind_to()} to undo the bindings when finished.
8829
8830 Note that, with the exception of @code{Fprogn}, these functions are
8831 typically called in real life only in interpreted code, since the byte
8832 compiler knows how to convert calls to these functions directly into
8833 byte code.
8834
8835 @node Catch and Throw, , Simple Special Forms, Evaluation; Stack Frames; Bindings
8836 @section Catch and Throw
8837 @cindex catch and throw
8838 @cindex throw, catch and
8839
8840 @example
8841 struct catchtag
8842 @{
8843 Lisp_Object tag;
8844 Lisp_Object val;
8845 struct catchtag *next;
8846 struct gcpro *gcpro;
8847 jmp_buf jmp;
8848 struct backtrace *backlist;
8849 int lisp_eval_depth;
8850 int pdlcount;
8851 @};
8852 @end example
8853
8854 @code{catch} is a Lisp function that places a catch around a body of
8855 code. A catch is a means of non-local exit from the code. When a catch
8856 is created, a tag is specified, and executing a @code{throw} to this tag
8857 will exit from the body of code caught with this tag, and its value will
8858 be the value given in the call to @code{throw}. If there is no such
8859 call, the code will be executed normally.
8860
8861 Information pertaining to a catch is held in a @code{struct catchtag},
8862 which is placed at the head of a linked list pointed to by
8863 @code{catchlist}. @code{internal_catch()} is passed a C function to
8864 call (@code{Fprogn()} when Lisp @code{catch} is called) and arguments to
8865 give it, and places a catch around the function. Each @code{struct
8866 catchtag} is held in the stack frame of the @code{internal_catch()}
8867 instance that created the catch.
8868
8869 @code{internal_catch()} is fairly straightforward. It stores into the
8870 @code{struct catchtag} the tag name and the current values of
8871 @code{backtrace_list}, @code{lisp_eval_depth}, @code{gcprolist}, and the
8872 offset into the @code{specpdl} array, sets a jump point with @code{_setjmp()}
8873 (storing the jump point into the @code{struct catchtag}), and calls the
8874 function. Control will return to @code{internal_catch()} either when
8875 the function exits normally or through a @code{_longjmp()} to this jump
8876 point. In the latter case, @code{throw} will store the value to be
8877 returned into the @code{struct catchtag} before jumping. When it's
8878 done, @code{internal_catch()} removes the @code{struct catchtag} from
8879 the catchlist and returns the proper value.
8880
8881 @code{Fthrow()} goes up through the catchlist until it finds one with
8882 a matching tag. It then calls @code{unbind_catch()} to restore
8883 everything to what it was when the appropriate catch was set, stores the
8884 return value in the @code{struct catchtag}, and jumps (with
8885 @code{_longjmp()}) to its jump point.
8886
8887 @code{unbind_catch()} removes all catches from the catchlist until it
8888 finds the correct one. Some of the catches might have been placed for
8889 error-trapping, and if so, the appropriate entries on the handlerlist
8890 must be removed (see ``errors''). @code{unbind_catch()} also restores
8891 the values of @code{gcprolist}, @code{backtrace_list}, and
8892 @code{lisp_eval}, and calls @code{unbind_to()} to undo any specbindings
8893 created since the catch.
8894
8895
8896 @node Symbols and Variables, Buffers, Evaluation; Stack Frames; Bindings, Top
8897 @chapter Symbols and Variables
8898 @cindex symbols and variables
8899 @cindex variables, symbols and
8900
8901 @menu
8902 * Introduction to Symbols::
8903 * Obarrays::
8904 * Symbol Values::
8905 @end menu
8906
8907 @node Introduction to Symbols, Obarrays, Symbols and Variables, Symbols and Variables
8908 @section Introduction to Symbols
8909 @cindex symbols, introduction to
8910
8911 A symbol is basically just an object with four fields: a name (a
8912 string), a value (some Lisp object), a function (some Lisp object), and
8913 a property list (usually a list of alternating keyword/value pairs).
8914 What makes symbols special is that there is usually only one symbol with
8915 a given name, and the symbol is referred to by name. This makes a
8916 symbol a convenient way of calling up data by name, i.e. of implementing
8917 variables. (The variable's value is stored in the @dfn{value slot}.)
8918 Similarly, functions are referenced by name, and the definition of the
8919 function is stored in a symbol's @dfn{function slot}. This means that
8920 there can be a distinct function and variable with the same name. The
8921 property list is used as a more general mechanism of associating
8922 additional values with particular names, and once again the namespace is
8923 independent of the function and variable namespaces.
8924
8925 @node Obarrays, Symbol Values, Introduction to Symbols, Symbols and Variables
8926 @section Obarrays
8927 @cindex obarrays
8928
8929 The identity of symbols with their names is accomplished through a
8930 structure called an obarray, which is just a poorly-implemented hash
8931 table mapping from strings to symbols whose name is that string. (I say
8932 ``poorly implemented'' because an obarray appears in Lisp as a vector
8933 with some hidden fields rather than as its own opaque type. This is an
8934 Emacs Lisp artifact that should be fixed.)
8935
8936 Obarrays are implemented as a vector of some fixed size (which should
8937 be a prime for best results), where each ``bucket'' of the vector
8938 contains one or more symbols, threaded through a hidden @code{next}
8939 field in the symbol. Lookup of a symbol in an obarray, and adding a
8940 symbol to an obarray, is accomplished through standard hash-table
8941 techniques.
8942
8943 The standard Lisp function for working with symbols and obarrays is
8944 @code{intern}. This looks up a symbol in an obarray given its name; if
8945 it's not found, a new symbol is automatically created with the specified
8946 name, added to the obarray, and returned. This is what happens when the
8947 Lisp reader encounters a symbol (or more precisely, encounters the name
8948 of a symbol) in some text that it is reading. There is a standard
8949 obarray called @code{obarray} that is used for this purpose, although
8950 the Lisp programmer is free to create his own obarrays and @code{intern}
8951 symbols in them.
8952
8953 Note that, once a symbol is in an obarray, it stays there until
8954 something is done about it, and the standard obarray @code{obarray}
8955 always stays around, so once you use any particular variable name, a
8956 corresponding symbol will stay around in @code{obarray} until you exit
8957 XEmacs.
8958
8959 Note that @code{obarray} itself is a variable, and as such there is a
8960 symbol in @code{obarray} whose name is @code{"obarray"} and which
8961 contains @code{obarray} as its value.
8962
8963 Note also that this call to @code{intern} occurs only when in the Lisp
8964 reader, not when the code is executed (at which point the symbol is
8965 already around, stored as such in the definition of the function).
8966
8967 You can create your own obarray using @code{make-vector} (this is
8968 horrible but is an artifact) and intern symbols into that obarray.
8969 Doing that will result in two or more symbols with the same name.
8970 However, at most one of these symbols is in the standard @code{obarray}:
8971 You cannot have two symbols of the same name in any particular obarray.
8972 Note that you cannot add a symbol to an obarray in any fashion other
8973 than using @code{intern}: i.e. you can't take an existing symbol and put
8974 it in an existing obarray. Nor can you change the name of an existing
8975 symbol. (Since obarrays are vectors, you can violate the consistency of
8976 things by storing directly into the vector, but let's ignore that
8977 possibility.)
8978
8979 Usually symbols are created by @code{intern}, but if you really want,
8980 you can explicitly create a symbol using @code{make-symbol}, giving it
8981 some name. The resulting symbol is not in any obarray (i.e. it is
8982 @dfn{uninterned}), and you can't add it to any obarray. Therefore its
8983 primary purpose is as a symbol to use in macros to avoid namespace
8984 pollution. It can also be used as a carrier of information, but cons
8985 cells could probably be used just as well.
8986
8987 You can also use @code{intern-soft} to look up a symbol but not create
8988 a new one, and @code{unintern} to remove a symbol from an obarray. This
8989 returns the removed symbol. (Remember: You can't put the symbol back
8990 into any obarray.) Finally, @code{mapatoms} maps over all of the symbols
8991 in an obarray.
8992
8993 @node Symbol Values, , Obarrays, Symbols and Variables
8994 @section Symbol Values
8995 @cindex symbol values
8996 @cindex values, symbol
8997
8998 The value field of a symbol normally contains a Lisp object. However,
8999 a symbol can be @dfn{unbound}, meaning that it logically has no value.
9000 This is internally indicated by storing a special Lisp object, called
9001 @dfn{the unbound marker} and stored in the global variable
9002 @code{Qunbound}. The unbound marker is of a special Lisp object type
9003 called @dfn{symbol-value-magic}. It is impossible for the Lisp
9004 programmer to directly create or access any object of this type.
9005
9006 @strong{You must not let any ``symbol-value-magic'' object escape to
9007 the Lisp level.} Printing any of these objects will cause the message
9008 @samp{INTERNAL EMACS BUG} to appear as part of the print representation.
9009 (You may see this normally when you call @code{debug_print()} from the
9010 debugger on a Lisp object.) If you let one of these objects escape to
9011 the Lisp level, you will violate a number of assumptions contained in
9012 the C code and make the unbound marker not function right.
9013
9014 When a symbol is created, its value field (and function field) are set
9015 to @code{Qunbound}. The Lisp programmer can restore these conditions
9016 later using @code{makunbound} or @code{fmakunbound}, and can query to
9017 see whether the value of function fields are @dfn{bound} (i.e. have a
9018 value other than @code{Qunbound}) using @code{boundp} and
9019 @code{fboundp}. The fields are set to a normal Lisp object using
9020 @code{set} (or @code{setq}) and @code{fset}.
9021
9022 Other symbol-value-magic objects are used as special markers to
9023 indicate variables that have non-normal properties. This includes any
9024 variables that are tied into C variables (setting the variable magically
9025 sets some global variable in the C code, and likewise for retrieving the
9026 variable's value), variables that magically tie into slots in the
9027 current buffer, variables that are buffer-local, etc. The
9028 symbol-value-magic object is stored in the value cell in place of
9029 a normal object, and the code to retrieve a symbol's value
9030 (i.e. @code{symbol-value}) knows how to do special things with them.
9031 This means that you should not just fetch the value cell directly if you
9032 want a symbol's value.
9033
9034 The exact workings of this are rather complex and involved and are
9035 well-documented in comments in @file{buffer.c}, @file{symbols.c}, and
9036 @file{lisp.h}.
9037
9038 @node Buffers, Text, Symbols and Variables, Top
9039 @chapter Buffers
9040 @cindex buffers
9041
9042 @menu
9043 * Introduction to Buffers:: A buffer holds a block of text such as a file.
9044 * Buffer Lists:: Keeping track of all buffers.
9045 * Markers and Extents:: Tagging locations within a buffer.
9046 * The Buffer Object:: The Lisp object corresponding to a buffer.
9047 @end menu
9048
9049 @node Introduction to Buffers, Buffer Lists, Buffers, Buffers
9050 @section Introduction to Buffers
9051 @cindex buffers, introduction to
9052
9053 A buffer is logically just a Lisp object that holds some text.
9054 In this, it is like a string, but a buffer is optimized for
9055 frequent insertion and deletion, while a string is not. Furthermore:
9056
9057 @enumerate
9058 @item
9059 Buffers are @dfn{permanent} objects, i.e. once you create them, they
9060 remain around, and need to be explicitly deleted before they go away.
9061 @item
9062 Each buffer has a unique name, which is a string. Buffers are
9063 normally referred to by name. In this respect, they are like
9064 symbols.
9065 @item
9066 Buffers have a default insertion position, called @dfn{point}.
9067 Inserting text (unless you explicitly give a position) goes at point,
9068 and moves point forward past the text. This is what is going on when
9069 you type text into Emacs.
9070 @item
9071 Buffers have lots of extra properties associated with them.
9072 @item
9073 Buffers can be @dfn{displayed}. What this means is that there
9074 exist a number of @dfn{windows}, which are objects that correspond
9075 to some visible section of your display, and each window has
9076 an associated buffer, and the current contents of the buffer
9077 are shown in that section of the display. The redisplay mechanism
9078 (which takes care of doing this) knows how to look at the
9079 text of a buffer and come up with some reasonable way of displaying
9080 this. Many of the properties of a buffer control how the
9081 buffer's text is displayed.
9082 @item
9083 One buffer is distinguished and called the @dfn{current buffer}. It is
9084 stored in the variable @code{current_buffer}. Buffer operations operate
9085 on this buffer by default. When you are typing text into a buffer, the
9086 buffer you are typing into is always @code{current_buffer}. Switching
9087 to a different window changes the current buffer. Note that Lisp code
9088 can temporarily change the current buffer using @code{set-buffer} (often
9089 enclosed in a @code{save-excursion} so that the former current buffer
9090 gets restored when the code is finished). However, calling
9091 @code{set-buffer} will NOT cause a permanent change in the current
9092 buffer. The reason for this is that the top-level event loop sets
9093 @code{current_buffer} to the buffer of the selected window, each time
9094 it finishes executing a user command.
9095 @end enumerate
9096
9097 Make sure you understand the distinction between @dfn{current buffer}
9098 and @dfn{buffer of the selected window}, and the distinction between
9099 @dfn{point} of the current buffer and @dfn{window-point} of the selected
9100 window. (This latter distinction is explained in detail in the section
9101 on windows.)
9102
9103 @node Buffer Lists, Markers and Extents, Introduction to Buffers, Buffers
9104 @section Buffer Lists
9105 @cindex buffer lists
9106
9107 Recall earlier that buffers are @dfn{permanent} objects, i.e. that
9108 they remain around until explicitly deleted. This entails that there is
9109 a list of all the buffers in existence. This list is actually an
9110 assoc-list (mapping from the buffer's name to the buffer) and is stored
9111 in the global variable @code{Vbuffer_alist}.
9112
9113 The order of the buffers in the list is important: the buffers are
9114 ordered approximately from most-recently-used to least-recently-used.
9115 Switching to a buffer using @code{switch-to-buffer},
9116 @code{pop-to-buffer}, etc. and switching windows using
9117 @code{other-window}, etc. usually brings the new current buffer to the
9118 front of the list. @code{switch-to-buffer}, @code{other-buffer},
9119 etc. look at the beginning of the list to find an alternative buffer to
9120 suggest. You can also explicitly move a buffer to the end of the list
9121 using @code{bury-buffer}.
9122
9123 In addition to the global ordering in @code{Vbuffer_alist}, each frame
9124 has its own ordering of the list. These lists always contain the same
9125 elements as in @code{Vbuffer_alist} although possibly in a different
9126 order. @code{buffer-list} normally returns the list for the selected
9127 frame. This allows you to work in separate frames without things
9128 interfering with each other.
9129
9130 The standard way to look up a buffer given a name is
9131 @code{get-buffer}, and the standard way to create a new buffer is
9132 @code{get-buffer-create}, which looks up a buffer with a given name,
9133 creating a new one if necessary. These operations correspond exactly
9134 with the symbol operations @code{intern-soft} and @code{intern},
9135 respectively. You can also force a new buffer to be created using
9136 @code{generate-new-buffer}, which takes a name and (if necessary) makes
9137 a unique name from this by appending a number, and then creates the
9138 buffer. This is basically like the symbol operation @code{gensym}.
9139
9140 @node Markers and Extents, The Buffer Object, Buffer Lists, Buffers
9141 @section Markers and Extents
9142 @cindex markers and extents
9143 @cindex extents, markers and
9144
9145 Among the things associated with a buffer are things that are
9146 logically attached to certain buffer positions. This can be used to
9147 keep track of a buffer position when text is inserted and deleted, so
9148 that it remains at the same spot relative to the text around it; to
9149 assign properties to particular sections of text; etc. There are two
9150 such objects that are useful in this regard: they are @dfn{markers} and
9151 @dfn{extents}.
9152
9153 A @dfn{marker} is simply a flag placed at a particular buffer
9154 position, which is moved around as text is inserted and deleted.
9155 Markers are used for all sorts of purposes, such as the @code{mark} that
9156 is the other end of textual regions to be cut, copied, etc.
9157
9158 An @dfn{extent} is similar to two markers plus some associated
9159 properties, and is used to keep track of regions in a buffer as text is
9160 inserted and deleted, and to add properties (e.g. fonts) to particular
9161 regions of text. The external interface of extents is explained
9162 elsewhere.
9163
9164 The important thing here is that markers and extents simply contain
9165 buffer positions in them as integers, and every time text is inserted or
9166 deleted, these positions must be updated. In order to minimize the
9167 amount of shuffling that needs to be done, the positions in markers and
9168 extents (there's one per marker, two per extent) are stored in Membpos's.
9169 This means that they only need to be moved when the text is physically
9170 moved in memory; since the gap structure tries to minimize this, it also
9171 minimizes the number of marker and extent indices that need to be
9172 adjusted. Look in @file{insdel.c} for the details of how this works.
9173
9174 One other important distinction is that markers are @dfn{temporary}
9175 while extents are @dfn{permanent}. This means that markers disappear as
9176 soon as there are no more pointers to them, and correspondingly, there
9177 is no way to determine what markers are in a buffer if you are just
9178 given the buffer. Extents remain in a buffer until they are detached
9179 (which could happen as a result of text being deleted) or the buffer is
9180 deleted, and primitives do exist to enumerate the extents in a buffer.
9181
9182 @node The Buffer Object, , Markers and Extents, Buffers
9183 @section The Buffer Object
9184 @cindex buffer object, the
9185 @cindex object, the buffer
9186
9187 Buffers contain fields not directly accessible by the Lisp programmer.
9188 We describe them here, naming them by the names used in the C code.
9189 Many are accessible indirectly in Lisp programs via Lisp primitives.
9190
9191 @table @code
9192 @item name
9193 The buffer name is a string that names the buffer. It is guaranteed to
9194 be unique. @xref{Buffer Names,,, lispref, XEmacs Lisp Reference
9195 Manual}.
9196
9197 @item save_modified
9198 This field contains the time when the buffer was last saved, as an
9199 integer. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference
9200 Manual}.
9201
9202 @item modtime
9203 This field contains the modification time of the visited file. It is
9204 set when the file is written or read. Every time the buffer is written
9205 to the file, this field is compared to the modification time of the
9206 file. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference
9207 Manual}.
9208
9209 @item auto_save_modified
9210 This field contains the time when the buffer was last auto-saved.
9211
9212 @item last_window_start
9213 This field contains the @code{window-start} position in the buffer as of
9214 the last time the buffer was displayed in a window.
9215
9216 @item undo_list
9217 This field points to the buffer's undo list. @xref{Undo,,, lispref,
9218 XEmacs Lisp Reference Manual}.
9219
9220 @item syntax_table_v
9221 This field contains the syntax table for the buffer. @xref{Syntax
9222 Tables,,, lispref, XEmacs Lisp Reference Manual}.
9223
9224 @item downcase_table
9225 This field contains the conversion table for converting text to lower
9226 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
9227
9228 @item upcase_table
9229 This field contains the conversion table for converting text to upper
9230 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
9231
9232 @item case_canon_table
9233 This field contains the conversion table for canonicalizing text for
9234 case-folding search. @xref{Case Tables,,, lispref, XEmacs Lisp
9235 Reference Manual}.
9236
9237 @item case_eqv_table
9238 This field contains the equivalence table for case-folding search.
9239 @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}.
9240
9241 @item display_table
9242 This field contains the buffer's display table, or @code{nil} if it
9243 doesn't have one. @xref{Display Tables,,, lispref, XEmacs Lisp
9244 Reference Manual}.
9245
9246 @item markers
9247 This field contains the chain of all markers that currently point into
9248 the buffer. Deletion of text in the buffer, and motion of the buffer's
9249 gap, must check each of these markers and perhaps update it.
9250 @xref{Markers,,, lispref, XEmacs Lisp Reference Manual}.
9251
9252 @item backed_up
9253 This field is a flag that tells whether a backup file has been made for
9254 the visited file of this buffer.
9255
9256 @item mark
9257 This field contains the mark for the buffer. The mark is a marker,
9258 hence it is also included on the list @code{markers}. @xref{The Mark,,,
9259 lispref, XEmacs Lisp Reference Manual}.
9260
9261 @item mark_active
9262 This field is non-@code{nil} if the buffer's mark is active.
9263
9264 @item local_var_alist
9265 This field contains the association list describing the variables local
9266 in this buffer, and their values, with the exception of local variables
9267 that have special slots in the buffer object. (Those slots are omitted
9268 from this table.) @xref{Buffer-Local Variables,,, lispref, XEmacs Lisp
9269 Reference Manual}.
9270
9271 @item modeline_format
9272 This field contains a Lisp object which controls how to display the mode
9273 line for this buffer. @xref{Modeline Format,,, lispref, XEmacs Lisp
9274 Reference Manual}.
9275
9276 @item base_buffer
9277 This field holds the buffer's base buffer (if it is an indirect buffer),
9278 or @code{nil}.
9279 @end table
9280
9281 @node Text, Multilingual Support, Buffers, Top
9282 @chapter Text
9283 @cindex text
9284
9285 @menu
9286 * The Text in a Buffer:: Representation of the text in a buffer.
9287 * Ibytes and Ichars:: Representation of individual characters.
9288 * Byte-Char Position Conversion::
9289 * Searching and Matching:: Higher-level algorithms.
9290 @end menu
9291
9292 @node The Text in a Buffer, Ibytes and Ichars, Text, Text
9293 @section The Text in a Buffer
9294 @cindex text in a buffer, the
9295 @cindex buffer, the text in a
9296
9297 The text in a buffer consists of a sequence of zero or more
9298 characters. A @dfn{character} is an integer that logically represents
9299 a letter, number, space, or other unit of text. Most of the characters
9300 that you will typically encounter belong to the ASCII set of characters,
9301 but there are also characters for various sorts of accented letters,
9302 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana,
9303 etc.), Cyrillic and Greek letters, etc. The actual number of possible
9304 characters is quite large.
9305
9306 For now, we can view a character as some non-negative integer that
9307 has some shape that defines how it typically appears (e.g. as an
9308 uppercase A). (The exact way in which a character appears depends on the
9309 font used to display the character.) The internal type of characters in
9310 the C code is an @code{Ichar}; this is just an @code{int}, but using a
9311 symbolic type makes the code clearer.
9312
9313 Between every character in a buffer is a @dfn{buffer position} or
9314 @dfn{character position}. We can speak of the character before or after
9315 a particular buffer position, and when you insert a character at a
9316 particular position, all characters after that position end up at new
9317 positions. When we speak of the character @dfn{at} a position, we
9318 really mean the character after the position. (This schizophrenia
9319 between a buffer position being ``between'' two characters and ``on'' a
9320 character is rampant in Emacs.)
9321
9322 Buffer positions are numbered starting at 1. This means that
9323 position 1 is before the first character, and position 0 is not
9324 valid. If there are N characters in a buffer, then buffer
9325 position N+1 is after the last one, and position N+2 is not valid.
9326
9327 The internal makeup of the Ichar integer varies depending on whether
9328 we have compiled with MULE support. If not, the Ichar integer is an
9329 8-bit integer with possible values from 0 - 255. 0 - 127 are the
9330 standard ASCII characters, while 128 - 255 are the characters from the
9331 ISO-8859-1 character set. If we have compiled with MULE support, an
9332 Ichar is a 19-bit integer, with the various bits having meanings
9333 according to a complex scheme that will be detailed later. The
9334 characters numbered 0 - 255 still have the same meanings as for the
9335 non-MULE case, though.
9336
9337 Internally, the text in a buffer is represented in a fairly simple
9338 fashion: as a contiguous array of bytes, with a @dfn{gap} of some size
9339 in the middle. Although the gap is of some substantial size in bytes,
9340 there is no text contained within it: From the perspective of the text
9341 in the buffer, it does not exist. The gap logically sits at some buffer
9342 position, between two characters (or possibly at the beginning or end of
9343 the buffer). Insertion of text in a buffer at a particular position is
9344 always accomplished by first moving the gap to that position
9345 (i.e. through some block moving of text), then writing the text into the
9346 beginning of the gap, thereby shrinking the gap. If the gap shrinks
9347 down to nothing, a new gap is created. (What actually happens is that a
9348 new gap is ``created'' at the end of the buffer's text, which requires
9349 nothing more than changing a couple of indices; then the gap is
9350 ``moved'' to the position where the insertion needs to take place by
9351 moving up in memory all the text after that position.) Similarly,
9352 deletion occurs by moving the gap to the place where the text is to be
9353 deleted, and then simply expanding the gap to include the deleted text.
9354 (@dfn{Expanding} and @dfn{shrinking} the gap as just described means
9355 just that the internal indices that keep track of where the gap is
9356 located are changed.)
9357
9358 Note that the total amount of memory allocated for a buffer text never
9359 decreases while the buffer is live. Therefore, if you load up a
9360 20-megabyte file and then delete all but one character, there will be a
9361 20-megabyte gap, which won't get any smaller (except by inserting
9362 characters back again). Once the buffer is killed, the memory allocated
9363 for the buffer text will be freed, but it will still be sitting on the
9364 heap, taking up virtual memory, and will not be released back to the
9365 operating system. (However, if you have compiled XEmacs with rel-alloc,
9366 the situation is different. In this case, the space @emph{will} be
9367 released back to the operating system. However, this tends to result in a
9368 noticeable speed penalty.)
9369
9370 Astute readers may notice that the text in a buffer is represented as
9371 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is
9372 a 19-bit integer, which clearly cannot fit in a byte. This means (of
9373 course) that the text in a buffer uses a different representation from
9374 an Ichar: specifically, the 19-bit Ichar becomes a series of one to
9375 four bytes. The conversion between these two representations is complex
9376 and will be described later.
9377
9378 In the non-MULE case, everything is very simple: An Ichar
9379 is an 8-bit value, which fits neatly into one byte.
9380
9381 If we are given a buffer position and want to retrieve the
9382 character at that position, we need to follow these steps:
9383
9384 @enumerate
9385 @item
9386 Pretend there's no gap, and convert the buffer position into a @dfn{byte
9387 index} that indexes to the appropriate byte in the buffer's stream of
9388 textual bytes. By convention, byte indices begin at 1, just like buffer
9389 positions. In the non-MULE case, byte indices and buffer positions are
9390 identical, since one character equals one byte.
9391 @item
9392 Convert the byte index into a @dfn{memory index}, which takes the gap
9393 into account. The memory index is a direct index into the block of
9394 memory that stores the text of a buffer. This basically just involves
9395 checking to see if the byte index is past the gap, and if so, adding the
9396 size of the gap to it. By convention, memory indices begin at 1, just
9397 like buffer positions and byte indices, and when referring to the
9398 position that is @dfn{at} the gap, we always use the memory position at
9399 the @emph{beginning}, not at the end, of the gap.
9400 @item
9401 Fetch the appropriate bytes at the determined memory position.
9402 @item
9403 Convert these bytes into an Ichar.
9404 @end enumerate
9405
9406 In the non-Mule case, (3) and (4) boil down to a simple one-byte
9407 memory access.
9408
9409 Note that we have defined three types of positions in a buffer:
9410
9411 @enumerate
9412 @item
9413 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos}
9414 @item
9415 @dfn{byte indices}, typedef @code{Bytebpos}
9416 @item
9417 @dfn{memory indices}, typedef @code{Membpos}
9418 @end enumerate
9419
9420 All three typedefs are just @code{int}s, but defining them this way makes
9421 things a lot clearer.
9422
9423 Most code works with buffer positions. In particular, all Lisp code
9424 that refers to text in a buffer uses buffer positions. Lisp code does
9425 not know that byte indices or memory indices exist.
9426
9427 Finally, we have a typedef for the bytes in a buffer. This is a
9428 @code{Ibyte}, which is an unsigned char. Referring to them as
9429 Ibytes underscores the fact that we are working with a string of bytes
9430 in the internal Emacs buffer representation rather than in one of a
9431 number of possible alternative representations (e.g. EUC-encoded text,
9432 etc.).
9433
9434 @node Ibytes and Ichars, Byte-Char Position Conversion, The Text in a Buffer, Text
9435 @section Ibytes and Ichars
9436 @cindex Ibytes and Ichars
9437 @cindex Ichars, Ibytes and
9438
9439 Not yet documented.
9440
9441 @node Byte-Char Position Conversion, Searching and Matching, Ibytes and Ichars, Text
9442 @section Byte-Char Position Conversion
9443 @cindex byte-char position conversion
9444 @cindex position conversion, byte-char
9445 @cindex conversion, byte-char position
9446
9447 Oct 2004:
9448
9449 This is what I wrote when describing the previous algorithm:
9450
9451 @quotation
9452 The basic algorithm we use is to keep track of a known region of
9453 characters in each buffer, all of which are of the same width. We keep
9454 track of the boundaries of the region in both Charbpos and Bytebpos
9455 coordinates and also keep track of the char width, which is 1 - 4 bytes.
9456 If the position we're translating is not in the known region, then we
9457 invoke a function to update the known region to surround the position in
9458 question. This assumes locality of reference, which is usually the
9459 case.
9460
9461 Note that the function to update the known region can be simple or
9462 complicated depending on how much information we cache. In addition to
9463 the known region, we always cache the correct conversions for point,
9464 BEGV, and ZV, and in addition to this we cache 16 positions where the
9465 conversion is known. We only look in the cache or update it when we
9466 need to move the known region more than a certain amount (currently 50
9467 chars), and then we throw away a "random" value and replace it with the
9468 newly calculated value.
9469
9470 Finally, we maintain an extra flag that tracks whether the buffer is
9471 entirely ASCII, to speed up the conversions even more. This flag is
9472 actually of dubious value because in an entirely-ASCII buffer the known
9473 region will always span the entire buffer (in fact, we update the flag
9474 based on this fact), and so all we're saving is a few machine cycles.
9475
9476 A potentially smarter method than what we do with known regions and
9477 cached positions would be to keep some sort of pseudo-extent layer over
9478 the buffer; maybe keep track of the charbpos/bytebpos correspondence at
9479 the beginning of each line, which would allow us to do a binary search
9480 over the pseudo-extents to narrow things down to the correct line, at
9481 which point you could use a linear movement method. This would also
9482 mesh well with efficiently implementing a line-numbering scheme.
9483 However, you have to weigh the amount of time spent updating the cache
9484 vs. the savings that result from it. In reality, we modify the buffer
9485 far less often than we access it, so a cache of this sort that provides
9486 guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a
9487 maximum on the cache size) would indeed be a win, particularly in very
9488 large buffers. If we ever implement this, we should probably set a
9489 reasonably high minimum below which we use the old method, because the
9490 time spent updating the fancy cache would likely become dominant when
9491 making buffer modifications in smaller buffers.
9492
9493 Note also that we have to multiply or divide by the char width in order
9494 to convert the positions. We do some tricks to avoid ever actually
9495 having to do a multiply or divide, because that is typically an
9496 expensive operation (esp. divide). Multiplying or dividing by 1, 2, or
9497 4 can be implemented simply as a shift left or shift right, and we keep
9498 track of a shifter value (0, 1, or 2) indicating how much to shift.
9499 Multiplying by 3 can be implemented by doubling and then adding the
9500 original value. Dividing by 3, alas, cannot be implemented in any
9501 simple shift/subtract method, as far as I know; so we just do a table
9502 lookup. For simplicity, we use a table of size 128K, which indexes the
9503 "divide-by-3" values for the first 64K non-negative numbers. (Note that
9504 we can increase the size up to 384K, i.e. indexing the first 192K
9505 non-negative numbers, while still using shorts in the array.) This also
9506 means that the size of the known region can be at most 64K for
9507 width-three characters.
9508 @end quotation
9509
9510 Unfortunately, it turned out that the implementation had serious problems
9511 which had never been corrected. In particular, the known region had a
9512 large tendency to become zero-length and stay that way.
9513
9514 So I decided to port the algorithm from FSF 21.3, in markers.c.
9515
9516 This algorithm is fairly simple. Instead of using markers I kept the cache
9517 array of known positions from the previous implementation.
9518
9519 Basically, we keep a number of positions cached:
9520
9521 @itemize @bullet
9522 @item
9523 the actual end of the buffer
9524 @item
9525 the beginning and end of the accessible region
9526 @item
9527 the value of point
9528 @item
9529 the position of the gap
9530 @item
9531 the last value we computed
9532 @item
9533 a set of positions that are "far away" from previously computed positions
9534 (5000 chars currently; #### perhaps should be smaller)
9535 @end itemize
9536
9537 For each position, we @code{CONSIDER()} it. This means:
9538
9539 @itemize @bullet
9540 @item
9541 If the position is what we're looking for, return it directly.
9542 @item
9543 Starting with the beginning and end of the buffer, we successively
9544 compute the smallest enclosing range of known positions. If at any
9545 point we discover that this range has the same byte and char length
9546 (i.e. is entirely single-byte), then our computation is trivial.
9547 @item
9548 If at any point we get a small enough range (50 chars currently),
9549 stop considering further positions.
9550 @end itemize
9551
9552 Otherwise, once we have an enclosing range, see which side is closer, and
9553 iterate until we find the desired value. As an optimization, I replaced
9554 the simple loop in FSF with the use of @code{bytecount_to_charcount()},
9555 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or
9556 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.)
9557 These scan 4 or 8 bytes at a time through purely single-byte characters.
9558
9559 If the amount we had to scan was more than our "far away" distance (5000
9560 characters, see above), then cache the new position.
9561
9562 #### Things to do:
9563
9564 @itemize @bullet
9565 @item
9566 Look at the most recent GNU Emacs to see whether anything has changed.
9567 @item
9568 Think about whether it makes sense to try to implement some sort of
9569 known region or list of "known regions", like we had before. This would
9570 be a region of entirely single-byte characters that we can check very
9571 quickly. (Previously I used a range of same-width characters of any
9572 size; but this adds extra complexity and slows down the scanning, and is
9573 probably not worth it.) As part of the scanning process in
9574 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely
9575 single-byte chars, so it should be easy to remember the last one.
9576 Presumably what we should do is keep track of the largest known surrounding
9577 entirely-single-byte region for each of the cache positions as well as
9578 perhaps the last-cached position. We want to be careful not to get bitten
9579 by the previous problem of having the known region getting reset too
9580 often. If we implement this, we might well want to continue scanning
9581 some distance past the desired position (maybe 300-1000 bytes) if we are
9582 in a single-byte range so that we won't end up expanding the known range
9583 one position at a time and entering the function each time.
9584 @item
9585 Think about whether it makes sense to keep the position cache sorted.
9586 This would allow it to be larger and finer-grained in its positions.
9587 Note that with FSF's use of markers, they were sorted, but this
9588 was not really made good use of. With an array, we can do binary searching
9589 to quickly find the smallest range. We would probably want to make use of
9590 the gap-array code in extents.c.
9591 @end itemize
9592
9593 Note that FSF's algorithm checked @strong{ALL} markers, not just the ones cached
9594 by this algorithm. This includes markers created by the user as well as
9595 both ends of any overlays. We could do similarly, and our extents could
9596 keep both byte and character positions rather than just the former. (But
9597 this would probably be overkill. We should just use our cache instead.
9598 Any place an extent was set was surely already visited by the char<-->byte
9599 conversion routines.)
9600
9601 @node Searching and Matching, , Byte-Char Position Conversion, Text
9602 @section Searching and Matching
9603 @cindex searching
9604 @cindex matching
9605
9606 Very incomplete, limited to a brief introduction.
9607
9608 People find the searching and matching code difficult to understand.
9609 And indeed, the details are hard. However, the basic structures are not
9610 so complex. First, there's a hard question with a simple answer. What
9611 about Mule? The answer here is that it turns out that Mule characters
9612 can be matched byte by byte, so neither the search code nor the regular
9613 expression code need take much notice of it at all! Of course, we add
9614 some special features (such as regular expressions that match only
9615 certain charsets), but these do not require new concepts. The main
9616 exception is that wild-card matches in Mule have to be careful to
9617 swallow whole characters. This is handled using the same basic macros
9618 that are used for buffer and string movements.
9619
9620 This will also be true if a UTF-8 representation is used for the
9621 internal encoding.
9622
9623 The complex algorithms for searching are for simple string searches. In
9624 particular, the algorithm used for fast string searching is Boyer-Moore.
9625 This algorithm is based on the idea that if you have a mismatch at a
9626 given position, you can precompute where to restart the search. This
9627 typically means that you can often make many fewer than N character
9628 comparisons, where N is the position at which the match is found, or the
9629 size of the text if it contains no match. That's fast! But it's not
9630 easy. You must ``compile'' the search string into a jump table. See
9631 the source, @file{search.c}, for more information.
9632
9633 Emacs changes the basic algorithms somewhat in order to handle
9634 case-insensitive searches without a full-blown regular expression.
9635
9636 Regular expressions, on the other hand, have a trivial search
9637 implementation: try a match at each position. (Under POSIX rules, it's
9638 a bit more complex, because POSIX requires that you find the
9639 @emph{longest} match in the text. This means you keep a record of the
9640 best match so far, and find all the matches.)
9641
9642 The matching code for regular expressions is quite complex. First, the
9643 regular expression itself is compiled. There are two basic approaches
9644 that could be taken. The first is to compile the expression into tables
9645 to drive a generic finite automaton emulator. This is the approach
9646 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi,
9647 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka
9648 ``The Dragon Book'') as well as being used by the @file{lex} family of
9649 lexical analysis engines.
9650
9651 Emacs uses a somewhat different technique. The expression is compiled
9652 into a form of bytecode, which is interpreted by a special interpreter.
9653 The interpreter itself basically amounts to an inline implementation of
9654 the finite automaton emulator. The advantage of this technique is that
9655 it's easier to add special features, such as control of case-sensitivity
9656 via a global variable.
9657
9658 The compiler is not treated here. See the source, @file{regex.c}. The
9659 interpreter, although it is divided into several functions, and looks
9660 fearsomely complex, is actually quite simple in concept. However,
9661 basically what you're doing there is a strcmp on steroids, right?
9662
9663 @example
9664 int
9665 strcmp (char *p, /* pattern pointer */
9666 char *b) /* buffer pointer */
9667 @{
9668 while (*p++ == *b++)
9669 ;
9670 return *(--p) - *(--b); /* oops, we overshot */
9671 @}
9672 @end example
9673
9674 Really, it's no harder than that. (A bit of a white lie, OK?)
9675
9676 How does the regexp code generalize this?
9677
9678 @enumerate
9679 @item
9680 Depending on the pattern, @code{*b} may have a general relationship to
9681 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is
9682 generalized to include checks for set membership, and context dependent
9683 properties. This depends on @code{&*b}. Of course that's meaningless
9684 in C, so we use @code{b} directly, instead.
9685
9686 @item
9687 Although to ensure the algorithm terminates, @code{b} must advance step
9688 by step, @code{p} can branch and jump.
9689
9690 @item
9691 The information returned is much greater, including information about
9692 subexpressions.
9693 @end enumerate
9694
9695 We'll ignore (3). (2) is mostly interesting when compiling the regular
9696 expression. Now we have
9697
9698 @example
9699 @group
9700 enum operator_t @{
9701 accept = 0,
9702 exact,
9703 any,
9704 range,
9705 group, /* actually, these are probably */
9706 repeat, /* turned into conditional code */
9707 /* etc */
9708 @};
9709 @end group
9710
9711 @group
9712 enum status_t @{
9713 working = 0,
9714 matched,
9715 mismatch,
9716 end_of_buffer,
9717 error
9718 @};
9719 @end group
9720
9721 @group
9722 struct pattern @{
9723 enum operator_t operator;
9724 char char_value;
9725 boolean range_table[256];
9726 /* etc, etc */
9727 @};
9728 @end group
9729
9730 @group
9731 char *p, /* pattern pointer */
9732 *b; /* buffer pointer */
9733
9734 enum status_t
9735 match (struct pattern *p, char *b)
9736 @{
9737 enum status_t done = working;
9738
9739 while (!(done = match_1_operator (p, b)))
9740 @{
9741 struct pattern *p1 = p;
9742 p = next_p (p, b);
9743 b = next_b (p1, b);
9744 @}
9745 return done;
9746 @}
9747 @end group
9748 @end example
9749
9750 This format exposes the underlying finite automaton.
9751
9752 All of them have the following structure, except that the @samp{next_*}
9753 functions decide where to jump (for @samp{p}) and whether or not to
9754 increment (for @samp{b}), rather than checking for satisfaction of a
9755 matching condition.
9756
9757 @example
9758 enum status_t
9759 match_1_operator (pattern *p, char *b)
9760 @{
9761 if (! *b) return end_of_buffer;
9762 switch (p->operator)
9763 @{
9764 case accept:
9765 return matched;
9766 case exact:
9767 if (*b != p->char_value) return mismatch; else break;
9768 case any:
9769 break;
9770 case range:
9771 /* range_table is computed in the regexp_compile function */
9772 if (! p->range_table[*b]) return mismatch;
9773 /* etc, etc */
9774 @}
9775 return working;
9776 @}
9777 @end example
9778
9779 Grouping, repetition, and alternation are handled by compiling the
9780 subexpression and calling @code{match (p->subpattern, b)} recursively.
9781
9782 In terms of reading the actual code, there are five optimizations
9783 (obfuscations, if you like) that have been done.
9784
9785 @enumerate
9786 @item
9787 An explicit "failure stack" has been substituted for recursion.
9788
9789 @item
9790 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions
9791 are actually inlined into the @code{match} function for efficiency.
9792 Then the pointer movement is interspersed with the matching operations.
9793
9794 @item
9795 If the operator uses buffer context, the buffer pointer movement is
9796 sometimes implicit in the operations retrieving the context.
9797
9798 @item
9799 Some cases are combined into short preparation for individual cases, and
9800 a "fall-through" into combined code for several cases.
9801
9802 @item
9803 The @code{pattern} type is not an explicit @samp{struct}. Instead, the
9804 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the
9805 compiled bytecode. This leads to bizarre code in the interpreter like
9806
9807 @example
9808 case range:
9809 p += *(p + 1); break;
9810 @end example
9811
9812 in @code{next_p}, because the compiled pattern is laid out
9813
9814 @example
9815 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ...
9816 @end example
9817 @end enumerate
9818
9819 But if you keep your eye on the "switch in a loop" structure, you
9820 should be able to understand the parts you need.
9821
9822 @node Multilingual Support, The Lisp Reader and Compiler, Text, Top
9823 @chapter Multilingual Support
9824 @cindex Mule character sets and encodings
9825 @cindex character sets and encodings, Mule
9826 @cindex encodings, Mule character sets and
9827
9828 @emph{NOTE}: There is a great deal of overlapping and redundant
9829 information in this chapter. Ben wrote introductions to Mule issues a
9830 number of times, each time not realizing that he had already written
9831 another introduction previously. Hopefully, in time these will all be
9832 integrated.
9833
9834 @emph{NOTE}: The information at the top of the source file
9835 @file{text.c} is more complete than the following, and there is also a
9836 list of all other places to look for text/I18N-related info. Also look in
9837 @file{text.h} for info about the DFC and Eistring API's.
9838
9839 Recall that there are two primary ways that text is represented in
9840 XEmacs. The @dfn{buffer} representation sees the text as a series of
9841 bytes (Ibytes), with a variable number of bytes used per character.
9842 The @dfn{character} representation sees the text as a series of integers
9843 (Ichars), one per character. The character representation is a cleaner
9844 representation from a theoretical standpoint, and is thus used in many
9845 cases when lots of manipulations on a string need to be done. However,
9846 the buffer representation is the standard representation used in both
9847 Lisp strings and buffers, and because of this, it is the ``default''
9848 representation that text comes in. The reason for using this
9849 representation is that it's compact and is compatible with ASCII.
9850
9851 @menu
9852 * Introduction to Multilingual Issues #1::
9853 * Introduction to Multilingual Issues #2::
9854 * Introduction to Multilingual Issues #3::
9855 * Introduction to Multilingual Issues #4::
9856 * Character Sets::
9857 * Encodings::
9858 * Internal Mule Encodings::
9859 * Byte/Character Types; Buffer Positions; Other Typedefs::
9860 * Internal Text API's::
9861 * Coding for Mule::
9862 * CCL::
9863 * Modules for Internationalization::
9864 @end menu
9865
9866 @node Introduction to Multilingual Issues #1, Introduction to Multilingual Issues #2, Multilingual Support, Multilingual Support
9867 @section Introduction to Multilingual Issues #1
9868 @cindex introduction to multilingual issues #1
9869
9870 There is an introduction to these issues in the Lisp Reference manual.
9871 @xref{Internationalization Terminology,,, lispref, XEmacs Lisp Reference
9872 Manual}. Among other documentation that may be of interest to internals
9873 programmers is ISO-2022 (@pxref{ISO 2022,,, lispref, XEmacs Lisp
9874 Reference Manual}) and CCL (@pxref{CCL,,, lispref, XEmacs Lisp Reference
9875 Manual})
9876
9877 @node Introduction to Multilingual Issues #2, Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #1, Multilingual Support
9878 @section Introduction to Multilingual Issues #2
9879 @cindex introduction to multilingual issues #2
9880
9881 @subheading Introduction
9882
9883 This document covers a number of design issues, problems and proposals
9884 with regards to XEmacs MULE. At first we present some definitions and
9885 some aspects of the design that have been agreed upon. Then we present
9886 some issues and problems that need to be addressed, and then I include a
9887 proposal of mine to address some of these issues. When there are other
9888 proposals, for example from Olivier, these will be appended to the end
9889 of this document.
9890
9891 @subheading Definitions and Design Basics
9892
9893 First, @dfn{text} is defined to be a series of characters which together
9894 defines an utterance or partial utterance in some language.
9895 Generally, this language is a human language, but it may also be a
9896 computer language if the computer language uses a representation close
9897 enough to that of human languages for it to also make sense to call its
9898 representation text. Text is opposed to @dfn{binary}, which is a sequence
9899 of bytes, representing machine-readable but not human-readable data.
9900 A @dfn{byte} is merely a number within a predefined range, which nowadays is
9901 nearly always zero to 255. A @dfn{character} is a unit of text. What makes
9902 one character different from another is not always clear-cut. It is
9903 generally related to the appearance of the character, although perhaps
9904 not any possible appearance of that character, but some sort of ideal
9905 appearance that is assigned to a character. Whether two characters
9906 that look very similar are actually the same depends on various
9907 factors such as political ones, such as whether the characters are
9908 used to mean similar sorts of things, or behave similarly in similar
9909 contexts. In any case, it is not always clearly defined whether two
9910 characters are actually the same or not. In practice, however, this
9911 is more or less agreed upon.
9912
9913 A @dfn{character set} is just that, a set of one or more characters.
9914 The set is unique in that there will not be more than one instance of
9915 the same character in a character set, and logically is unordered,
9916 although an order is often imposed or suggested for the characters in
9917 the character set. We can also define an @dfn{order} on a character
9918 set, which is a way of assigning a unique number, or possibly a pair of
9919 numbers, or a triplet of numbers, or even a set of four or more numbers
9920 to each character in the character set. The combination of an order in
9921 the character set results in an @dfn{ordered character set}. In an
9922 ordered character set, there is an upper limit and a lower limit on the
9923 possible values that a character, or that any number within the set of
9924 numbers assigned to a character, can take. However, the lower limit
9925 does not have to start at zero or one, or anywhere else in particular,
9926 nor does the upper limit have to end anywhere particular, and there may
9927 be gaps within these ranges such that particular numbers or sets of
9928 numbers do not have a corresponding character, even though they are
9929 within the upper and lower limits. For example, @dfn{ASCII} defines a
9930 very standard ordered character set. It is normally defined to be 94
9931 characters in the range 33 through 126 inclusive on both ends, with
9932 every possible character within this range being actually present in the
9933 character set.
9934
9935 Sometimes the ASCII character set is extended to include what are called
9936 @dfn{non-printing characters}. Non-printing characters are characters
9937 which instead of really being displayed in a more or less rectangular
9938 block, like all other characters, instead indicate certain functions
9939 typically related to either control of the display upon which the
9940 characters are being displayed, or have some effect on a communications
9941 channel that may be currently open and transmitting characters, or may
9942 change the meaning of future characters as they are being decoded, or
9943 some other similar function. You might say that non-printing characters
9944 are somewhat of a hack because they are a special exception to the
9945 standard concept of a character as being a printed glyph that has some
9946 direct correspondence in the non-computer world.
9947
9948 With non-printing characters in mind, the 94-character ordered character
9949 set called ASCII is often extended into a 96-character ordered character
9950 set, also often called ASCII, which includes in addition to the 94
9951 characters already mentioned, two non-printing characters, one called
9952 space and assigned the number 32, just below the bottom of the previous
9953 range, and another called @dfn{delete} or @dfn{rubout}, which is given
9954 number 127 just above the end of the previous range. Thus to reiterate,
9955 the result is a 96-character ordered character set, whose characters
9956 take the values from 32 to 127 inclusive. Sometimes ASCII is further
9957 extended to contain 32 more non-printing characters, which are given the
9958 numbers zero through 31 so that the result is a 128-character ordered
9959 character set with characters numbered zero through 127, and with many
9960 non-printing characters. Another way to look at this, and the way that
9961 is normally taken by XEmacs MULE, is that the characters that would be
9962 in the range 30 through 31 in the most extended definition of ASCII,
9963 instead form their own ordered character set, which is called
9964 @dfn{control zero}, and consists of 32 characters in the range zero
9965 through 31. A similar ordered character set called @dfn{control one} is
9966 also created, and it contains 32 more non-printing characters in the
9967 range 128 through 159. Note that none of these three ordered character
9968 sets overlaps in any of the numbers they are assigned to their
9969 characters, so they can all be used at once. Note further that the same
9970 character can occur in more than one character set. This was shown
9971 above, for example, in two different ordered character sets we defined,
9972 one of which we could have called @dfn{ASCII}, and the other
9973 @dfn{ASCII-extended}, to show that it had extended by two non-printable
9974 characters. Most of the characters in these two character sets are
9975 shared and present in both of them.
9976
9977 Note that there is no restriction on the size of the character set, or
9978 on the numbers that are assigned to characters in an ordered character
9979 set. It is often extremely useful to represent a sequence of characters
9980 as a sequence of bytes, where a byte as defined above is a number in the
9981 range zero to 255. An @dfn{encoding} does precisely this. It is simply
9982 a mapping from a sequence of characters, possibly augmented with
9983 information indicating the character set that each of these characters
9984 belongs to, to a sequence of bytes which represents that sequence of
9985 characters and no other -- which is to say the mapping is reversible.
9986
9987 A @dfn{coding system} is a set of rules for encoding a sequence of
9988 characters augmented with character set information into a sequence of
9989 bytes, and later performing the reverse operation. It is frequently
9990 possible to group coding systems into classes or types based on common
9991 features. Typically, for example, a particular coding system class
9992 may contain a base coding system which specifies some of the rules,
9993 but leaves the rest unspecified. Individual members of the coding
9994 system class are formed by starting with the base coding system, and
9995 augmenting it with additional rules to produce a particular coding
9996 system, what you might think of as a sort of variation within a
9997 theme.
9998
9999 @subheading XEmacs Specific Definitions
10000
10001 First of all, in XEmacs, the concept of character is a little different
10002 from the general definition given above. For one thing, the character
10003 set that a character belongs to may or may not be an inherent part of
10004 the character itself. In other words, the same character occurring in
10005 two different character sets may appear in XEmacs as two different
10006 characters. This is generally the case now, but we are attempting to
10007 move in the other direction. Different proposals may have different
10008 ideas about exactly the extent to which this change will be carried out.
10009 The general trend, though, is to represent all information about a
10010 character other than the character itself, using text properties
10011 attached to the character. That way two instances of the same character
10012 will look the same to lisp code that merely retrieves the character, and
10013 does not also look at the text properties of that character. Everyone
10014 involved is in agreement in doing it this way with all Latin characters,
10015 and in fact for all characters other than Chinese, Japanese, and Korean
10016 ideographs. For those, there may be a difference of opinion.
10017
10018 A second difference between the general definition of character and the
10019 XEmacs usage of character is that each character is assigned a unique
10020 number that distinguishes it from all other characters in the world, or
10021 at the very least, from all other characters currently existing anywhere
10022 inside the current XEmacs invocation. (If there is a case where the
10023 weaker statement applies, but not the stronger statement, it would
10024 possibly be with composite characters and any other such characters that
10025 are created on the sly.)
10026
10027 This unique number is called the @dfn{character representation} of the
10028 character, and its particular details are a matter of debate. There is
10029 the current standard in use that it is undoubtedly going to change.
10030 What has definitely been agreed upon is that it will be an integer, more
10031 specifically a positive integer, represented with less than or equal to
10032 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit
10033 architecture, with the proviso that any characters that whose
10034 representation would fit in a 64-bit architecture, but not on a 32-bit
10035 architecture, would be used only for composite characters, and others
10036 that would satisfy the weak uniqueness property mentioned above, but not
10037 with the strong uniqueness property.
10038
10039 At this point, it is useful to talk about the different representations
10040 that a sequence of characters can take. The simplest representation is
10041 simply as a sequence of characters, and this is called the @dfn{Lisp
10042 representation} of text, because it is the representation that Lisp
10043 programs see. Other representations include the external
10044 representation, which refers to any encoding of the sequence of
10045 characters, using the definition of encoding mentioned above.
10046 Typically, text in the external representation is used outside of
10047 XEmacs, for example in files, e-mail messages, web sites, and the like.
10048 Another representation for a sequence of characters is what I will call
10049 the @dfn{byte representation}, and it represents the way that XEmacs
10050 internally represents text in a buffer, or in a string. Potentially,
10051 the representation could be different between a buffer and a string, and
10052 then the terms @dfn{buffer byte representation} and @dfn{string byte
10053 representation} would be used, but in practice I don't think this will
10054 occur. It will be possible, of course, for buffers and strings, or
10055 particular buffers and particular strings, to contain different
10056 sub-representations of a single representation. For example, Olivier's
10057 1-2-4 proposal allows for three sub-representations of his internal byte
10058 representation, allowing for 1 byte, 2 bytes, and 4 byte width
10059 characters respectively. A particular string may be in one
10060 sub-representation, and a particular buffer in another
10061 sub-representation, but overall both are following the same byte
10062 representation. I do not use the term @dfn{internal representation}
10063 here, as many people have, because it is potentially ambiguous.
10064
10065 Another representation is called the @dfn{array of characters
10066 representation}. This is a representation on the C-level in which the
10067 sequence of text is represented, not using the byte representation, but
10068 by using an array of characters, each represented using the character
10069 representation. This sort of representation is often used by redisplay
10070 because it is more convenient to work with than any of the other
10071 internal representations.
10072
10073 The term @dfn{binary representation} may also be heard. Binary
10074 representation is used to represent binary data. When binary data is
10075 represented in the lisp representation, an equivalence is simply set up
10076 between bytes zero through 255, and characters zero through 255. These
10077 characters come from four character sets, which are from bottom to top,
10078 control zero, ASCII, control 1, and Latin 1. Together, they comprise
10079 256 characters, and are a good mapping for the 256 possible bytes in a
10080 binary representation. Binary representation could also be used to
10081 refer to an external representation of the binary data, which is a
10082 simple direct byte-to-byte representation. No internal representation
10083 should ever be referred to as a binary representation because of
10084 ambiguity. The terms character set/encoding system were defined
10085 generally, above. In XEmacs, the equivalent concepts exist, although
10086 character set has been shortened to charset, and in fact represents
10087 specifically an ordered character set. For each possible charset, and
10088 for each possible coding system, there is an associated object in
10089 XEmacs. These objects will be of type charset and coding system,
10090 respectively. Charsets and coding systems are divided into classes, or
10091 @dfn{types}, the normal term under XEmacs, and all possible charsets
10092 encoding systems that may be defined must be in one of these types. If
10093 you need to create a charset or coding system that is not one of these
10094 types, you will have to modify the C code to support this new type.
10095 Some of the existing or soon-to-be-created types are, or will be,
10096 generic enough so that this shouldn't be an issue. Note also that the
10097 byte encoding for text and the character coding of a character are
10098 closely related. You might say that ideally each is the simplest
10099 equivalent of the other given the general constraints on each
10100 representation.
10101
10102 To be specific, in the current MULE representation,
10103
10104 @enumerate
10105 @item
10106 Characters encode both the character itself and the character set
10107 that it comes from. These character sets are always assumed to be
10108 representable as an ordered character set of size 96 or of size 96
10109 by 96, or the trivially-related sizes 94 and 94 by 94. The only
10110 allowable exceptions are the control zero and control one character
10111 sets, which are of size 32. Character sets which do not naturally
10112 have a compatible ordering such as this are shoehorned into an
10113 ordered character set, or possibly two ordered character sets of a
10114 compatible size.
10115 @item
10116 The variable width byte representation was deliberately chosen to
10117 allow scanning text forwards and backwards efficiently. This
10118 necessitated defining the possible bytes into three ranges which
10119 we shall call A, B, and C. Range A is used exclusively for
10120 single-byte characters, which is to say characters that are
10121 representing using only one contiguous byte. Multi-byte
10122 characters are always represented by using one byte from Range B,
10123 followed by one or more bytes from Range C. What this means is
10124 that bytes that begin a character are unequivocally distinguished
10125 from bytes that do not begin a character, and therefore there is
10126 never a problem scaling backwards and finding the beginning of a
10127 character. Know that UTF8 adopts a proposal that is very similar
10128 in spirit in that it uses separate ranges for the first byte of a
10129 multi byte sequence, and the following bytes in multi-byte
10130 sequence.
10131 @item
10132 Given the fact that all ordered character sets allowed were
10133 essentially 96 characters per dimension, it made perfect sense to
10134 make Range C comprise 96 bytes. With a little more tweaking, the
10135 currently-standard MULE byte representation was created, and was
10136 drafted from this.
10137 @item
10138 The MULE byte representation defined four basic representations for
10139 characters, which would take up from one to four bytes,
10140 respectively. The MULE character representation thus had the
10141 following constraints:
10142 @enumerate
10143 @item
10144 Character numbers zero through 255 should represent the
10145 characters that binary values zero through 255 would be
10146 mapped onto. (Note: this was not the case in Kenichi Handa's
10147 version of this representation, but I changed it.)
10148 @item
10149 The four sub-classes of representation in the MULE byte
10150 representation should correspond to four contiguous
10151 non-overlapping ranges of characters.
10152 @item
10153 The algorithmic conversion between the single character
10154 represented in the byte representation and in the character
10155 representation should be as easy as possible.
10156 @item
10157 Given the previous constraints, the character representation
10158 should be as compact as possible, which is to say it should
10159 use the least number of bits possible.
10160 @end enumerate
10161 @end enumerate
10162
10163 So you see that the entire structure of the byte and character
10164 representations stemmed from a very small number of basic choices,
10165 which were
10166
10167 @enumerate
10168 @item
10169 the choice to encode character set information in a character
10170 @item
10171 the choice to assume that all character sets would have an order
10172 imposed upon them with 96 characters per one or two
10173 dimensions. (This is less arbitrary than it seems--it follows
10174 ISO-2022)
10175 @item
10176 the choice to use a variable width byte representation.
10177 @end enumerate
10178
10179 What this means is that you cannot really separate the byte
10180 representation, the character representation, and the assumptions made
10181 about characters and whether they represent character sets from each
10182 other. All of these are closely intertwined, and for purposes of
10183 simplicity, they should be designed together. If you change one
10184 representation without changing another, you are in essence creating a
10185 completely new design with its own attendant problems--since your new
10186 design is likely to be quite complex and not very coherent with
10187 regards to the translation between the character and byte
10188 representations, you are likely to run into problems.
10189
10190 @node Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #4, Introduction to Multilingual Issues #2, Multilingual Support
10191 @section Introduction to Multilingual Issues #3
10192 @cindex introduction to multilingual issues #3
10193
10194 In XEmacs, Mule is a code word for the support for input handling and
10195 display of multi-lingual text. This section provides an overview of how
10196 this support impacts the C and Lisp code in XEmacs. It is important for
10197 anyone who works on the C or the Lisp code, especially on the C code, to
10198 be aware of these issues, even if they don't work directly on code that
10199 implements multi-lingual features, because there are various general
10200 procedures that need to be followed in order to write Mule-compliant
10201 code. (The specifics of these procedures are documented elsewhere in
10202 this manual.)
10203
10204 There are four primary aspects of Mule support:
10205
10206 @enumerate
10207 @item
10208 internal handling and representation of multi-lingual text.
10209 @item
10210 conversion between the internal representation of text and the various
10211 external representations in which multi-lingual text is encoded, such as
10212 Unicode representations (including mostly fixed width encodings such as
10213 UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings,
10214 such as UTF-7 and UTF-8); the various ISO2022 representations, which
10215 typically use escape sequences to switch between different character
10216 sets (such as Compound Text, used under X Windows; JIS, used
10217 specifically for encoding Japanese; and EUC, a non-modal encoding used
10218 for Japanese, Korean, and certain other languages); Microsoft's
10219 multi-byte encodings (such as Shift-JIS); various simple encodings for
10220 particular 8-bit character sets (such as Latin-1 and Latin-2, and
10221 encodings (such as koi8 and Alternativny) for Cyrillic); and others.
10222 This conversion needs to happen both for text in files and text sent to
10223 or retrieved from system API calls. It even needs to happen for
10224 external binary data because the internal representation does not
10225 represent binary data simply as a sequence of bytes as it is represented
10226 externally.
10227 @item
10228 Proper display of multi-lingual characters.
10229 @item
10230 Input of multi-lingual text using the keyboard.
10231 @end enumerate
10232
10233 These four aspects are for the most part independent of each other.
10234
10235 @subheading Characters, Character Sets, and Encodings
10236
10237 A @dfn{character} (which is, BTW, a surprisingly complex concept) is, in
10238 a written representation of text, the most basic written unit that has a
10239 meaning of its own. It's comparable to a phoneme when analyzing words
10240 in spoken speech (for example, the sound of @samp{t} in English, which
10241 in fact has different pronunciations in different words -- aspirated in
10242 @samp{time}, unaspirated in @samp{stop}, unreleased or even pronounced
10243 as a glottal stop in @samp{button}, etc. -- but logically is a single
10244 concept). Like a phoneme, a character is an abstract concept defined by
10245 its @emph{meaning}. The character @samp{lowercase f}, for example, can
10246 always be used to represent the first letter in the word @samp{fill},
10247 regardless of whether it's drawn upright or italic, whether the
10248 @samp{fi} combination is drawn as a single ligature, whether there are
10249 serifs on the bottom of the vertical stroke, etc. (These different
10250 appearances of a single character are often called @dfn{graphs} or
10251 @dfn{glyphs}.) Our concern when representing text is on representing the
10252 abstract characters, and not on their exact appearance.
10253
10254 A @dfn{character set} (or @dfn{charset}), as we define it, is a set of
10255 characters, each with an associated number (or set of numbers -- see
10256 below), called a @dfn{code point}. It's important to understand that a
10257 character is not defined by any number attached to it, but by its
10258 meaning. For example, ASCII and EBCDIC are two charsets containing
10259 exactly the same characters (lowercase and uppercase letters, numbers 0
10260 through 9, particular punctuation marks) but with different
10261 numberings. The `comma' character in ASCII and EBCDIC, for instance, is
10262 the same character despite having a different numbering. Conversely,
10263 when comparing ASCII and JIS-Roman, which look the same except that the
10264 latter has a yen sign substituted for the backslash, we would say that
10265 the backslash and yen sign are @strong{not} the same characters, despite having
10266 the same number (95) and despite the fact that all other characters are
10267 present in both charsets, with the same numbering. ASCII and JIS-Roman,
10268 then, do @emph{not} have exactly the same characters in them (ASCII has
10269 a backslash character but no yen-sign character, and vice-versa for
10270 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
10271 and JIS-Roman are closer.
10272
10273 It's also important to distinguish between charsets and encodings. For
10274 a simple charset like ASCII, there is only one encoding normally used --
10275 each character is represented by a single byte, with the same value as
10276 its code point. For more complicated charsets, however, things are not
10277 so obvious. Unicode version 2, for example, is a large charset with
10278 thousands of characters, each indexed by a 16-bit number, often
10279 represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One
10280 obvious encoding uses two bytes per character (actually two encodings,
10281 depending on which of the two possible byte orderings is chosen). This
10282 encoding is convenient for internal processing of Unicode text; however,
10283 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is
10284 usually used for external text, for example files or e-mail. UTF-8
10285 represents Unicode characters with one to three bytes (often extended to
10286 six bytes to handle characters with up to 31-bit indices). Unicode
10287 characters 00 to 7F (identical with ASCII) are directly represented with
10288 one byte, and other characters with two or more bytes, each in the range
10289 80 to FF.
10290
10291 In general, a single encoding may be able to represent more than one
10292 charset.
10293
10294 @subheading Internal Representation of Text
10295
10296 In an ASCII or single-European-character-set world, life is very simple.
10297 There are 256 characters, and each character is represented using the
10298 numbers 0 through 255, which fit into a single byte. With a few
10299 exceptions (such as case-changing operations or syntax classes like
10300 'whitespace'), "text" is simply an array of indices into a font. You
10301 can get different languages simply by choosing fonts with different
10302 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and
10303 everything will "just work" as long as anyone else receiving your text
10304 uses a compatible font.
10305
10306 In the multi-lingual world, however, it is much more complicated. There
10307 are a great number of different characters which are organized in a
10308 complex fashion into various character sets. The representation to use
10309 is not obvious because there are issues of size versus speed to
10310 consider. In fact, there are in general two kinds of representations to
10311 work with: one that represents a single character using an integer
10312 (possibly a byte), and the other representing a single character as a
10313 sequence of bytes. The former representation is normally called fixed
10314 width, and the other variable width. Both representations represent
10315 exactly the same characters, and the conversion from one representation
10316 to the other is governed by a specific formula (rather than by table
10317 lookup) but it may not be simple. Most C code need not, and in fact
10318 should not, know the specifics of exactly how the representations work.
10319 In fact, the code must not make assumptions about the representations.
10320 This means in particular that it must use the proper macros for
10321 retrieving the character at a particular memory location, determining
10322 how many characters are present in a particular stretch of text, and
10323 incrementing a pointer to a particular character to point to the
10324 following character, and so on. It must not assume that one character
10325 is stored using one byte, or even using any particular number of bytes.
10326 It must not assume that the number of characters in a stretch of text
10327 bears any particular relation to a number of bytes in that stretch. It
10328 must not assume that the character at a particular memory location can
10329 be retrieved simply by dereferencing the memory location, even if a
10330 character is known to be ASCII or is being compared with an ASCII
10331 character, etc. Careful coding is required to be Mule clean. The
10332 biggest work of adding Mule support, in fact, is converting all of the
10333 existing code to be Mule clean.
10334
10335 Lisp code is mostly unaffected by these concerns. Text in strings and
10336 buffers appears simply as a sequence of characters regardless of
10337 whether Mule support is present. The biggest difference with older
10338 versions of Emacs, as well as current versions of GNU Emacs, is that
10339 integers and characters are no longer equivalent, but are separate
10340 Lisp Object types.
10341
10342 @subheading Conversion Between Internal and External Representations
10343
10344 All text needs to be converted to an external representation before being
10345 sent to a function or file, and all text retrieved from a function of
10346 file needs to be converted to the internal representation. This
10347 conversion needs to happen as close to the source or destination of the
10348 text as possible. No operations should ever be performed on text encoded
10349 in an external representation other than simple copying, because no
10350 assumptions can reliably be made about the format of this text. You
10351 cannot assume, for example, that the end of text is terminated by a null
10352 byte. (For example, if the text is Unicode, it will have many null bytes
10353 in it.) You cannot find the next "slash" character by searching through
10354 the bytes until you find a byte that looks like a "slash" character,
10355 because it might actually be the second byte of a Kanji character.
10356 Furthermore, all text in the internal representation must be converted,
10357 even if it is known to be completely ASCII, because the external
10358 representation may not be ASCII compatible (for example, if it is
10359 Unicode).
10360
10361 The place where C code needs to be the most careful is when calling
10362 external API functions. It is easy to forget that all text passed to or
10363 retrieved from these functions needs to be converted. This includes text
10364 in structures passed to or retrieved from these functions and all text
10365 that is passed to a callback function that is called by the system.
10366
10367 Macros are provided to perform conversions to or from external text.
10368 These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT
10369 respectively. These macros accept input in various forms, for example,
10370 Lisp strings, buffers, lstreams, raw data, and can return data in
10371 multiple formats, including both @code{malloc()}ed and @code{alloca()}ed data. The use
10372 of @code{alloca()}ed data here is particularly important because, in general,
10373 the returned data will not be used after making the API call, and as a
10374 result, using @code{alloca()}ed data provides a very cheap and easy to use
10375 method of allocation.
10376
10377 These macros take a coding system argument which indicates the nature of
10378 the external encoding. A coding system is an object that encapsulates
10379 the structures of a particular external encoding and the methods required
10380 to convert to and from this encoding. A facility exists to create coding
10381 system aliases, which in essence gives a single coding system two
10382 different names. It is effectively used in XEmacs to provide a layer of
10383 abstraction on top of the actual coding systems. For example, the coding
10384 system alias "file-name" points to whichever coding system is currently
10385 used for encoding and decoding file names as passed to or retrieved from
10386 system calls. In general, the actual encoding will differ from system to
10387 system, and also on the particular locale that the user is in. The use
10388 of the file-name alias effectively hides that implementation detail on
10389 top of that abstract interface layer which provides a unified set of
10390 coding systems which are consistent across all operating environments.
10391
10392 The choice of which coding system to use in a particular conversion macro
10393 requires some thought. In general, you should choose a lower-level
10394 actual coding system when the very design of the APIs you are working
10395 with call for that particular coding system. In all other cases, you
10396 should find the least general abstract coding system (i.e. coding system
10397 alias) that applies to your specific situation. Only use the most
10398 general coding systems, such as native, when there is simply nothing else
10399 that is more appropriate. By doing things this way, you allow the user
10400 more control over how the encoding actually works, because the user is
10401 free to map the abstracted coding system names onto to different actual
10402 coding systems.
10403
10404 Some common coding systems are:
10405
10406 @table @code
10407 @item ctext
10408 Compound Text, which is the standard encoding under X Windows, which is
10409 used for clipboard data and possibly other data. (ctext is a coding
10410 system of type ISO2022.)
10411
10412 @item mswindows-unicode
10413 this is used for representing text passed to MS Window API calls with
10414 arguments that need to be in Unicode format. (mswindows-unicode is a
10415 coding system of type UTF-16)
10416
10417 @item ms-windows-multi-byte
10418 this is used for representing text passed to MS Windows API calls with
10419 arguments that need to be in multi-byte format. Note that there are
10420 very few if any examples of such calls.
10421
10422 @item mswindows-tstr
10423 this is used for representing text passed to any MS Windows API calls
10424 that declare their argument as LPTSTR, or LPCTSTR. This is the vast
10425 majority of system calls and automatically translates either to
10426 mswindows-unicode or mswindows-multi-byte, depending on the presence or
10427 absence of the UNICODE preprocessor constant. (If we compile XEmacs
10428 with this preprocessor constant, then all API calls use Unicode for all
10429 text passed to or received from these API calls.)
10430
10431 @item terminal
10432 used for text sent to or read from a text terminal in the absence of a
10433 more specific coding system (calls to window-system specific APIs should
10434 use the appropriate window-specific coding system if it makes sense to
10435 do so.)
10436
10437 @item file-name
10438 used when specifying the names of files in the absence of a more
10439 specific encoding, such as ms-windows-tstr.
10440
10441 @item native
10442 the most general coding system for specifying text passed to system
10443 calls. This generally translates to whatever coding system is specified
10444 by the current locale. This should only be used when none of the coding
10445 systems mentioned above are appropriate.
10446 @end table
10447
10448 @subheading Proper Display of Multilingual Text
10449
10450 There are two things required to get this working correctly. One is
10451 selecting the correct font, and the other is encoding the text according
10452 to the encoding used for that specific font, or the window-system
10453 specific text display API. Generally each separate character set has a
10454 different font associated with it, which is specified by name and each
10455 font has an associated encoding into which the characters must be
10456 translated. (this is the case on X Windows, at least; on Windows there
10457 is a more general mechanism). Both the specific font for a charset and
10458 the encoding of that font are system dependent. Currently there is a
10459 way of specifying these two properties under X Windows (using the
10460 registry and ccl properties of a character set) but not for other window
10461 systems. A more general system needs to be implemented to allow these
10462 characteristics to be specified for all Windows systems.
10463
10464 Another issue is making sure that the necessary fonts for displaying
10465 various character sets are installed on the system. Currently, XEmacs
10466 provides, on its web site, X Windows fonts for a number of different
10467 character sets that can be installed by users. This isn't done yet for
10468 Windows, but it should be.
10469
10470 @subheading Inputting of Multilingual Text
10471
10472 This is a rather complicated issue because there are many paradigms
10473 defined for inputting multi-lingual text, some of which are specific to
10474 particular languages, and any particular language may have many
10475 different paradigms defined for inputting its text. These paradigms are
10476 encoded in input methods and there is a standard API for defining an
10477 input method in XEmacs called LEIM, or Library of Emacs Input Methods.
10478 Some of these input methods are written entirely in Elisp, and thus are
10479 system-independent, while others require the aid either of an external
10480 process, or of C level support that ties into a particular
10481 system-specific input method API, for example, XIM under X Windows, or
10482 the active keyboard layout and IME support under Windows. Currently,
10483 there is no support for any system-specific input methods under
10484 Microsoft Windows, although this will change.
10485
10486 @node Introduction to Multilingual Issues #4, Character Sets, Introduction to Multilingual Issues #3, Multilingual Support
10487 @section Introduction to Multilingual Issues #4
10488 @cindex introduction to multilingual issues #4
10489
10490 The rest of the sections in this chapter consist of yet another
10491 introduction to multilingual issues, duplicating the information in the
10492 previous sections.
10493
10494 @node Character Sets, Encodings, Introduction to Multilingual Issues #4, Multilingual Support
10495 @section Character Sets
10496 @cindex character sets
10497
10498 A @dfn{character set} (or @dfn{charset}) is an ordered set of
10499 characters. A particular character in a charset is indexed using one or
10500 more @dfn{position codes}, which are non-negative integers. The number
10501 of position codes needed to identify a particular character in a charset
10502 is called the @dfn{dimension} of the charset. In XEmacs/Mule, all
10503 charsets have dimension 1 or 2, and the size of all charsets (except for
10504 a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range
10505 of position codes used to index characters from any of these types of
10506 character sets is as follows:
10507
10508 @example
10509 Charset type Position code 1 Position code 2
10510 ------------------------------------------------------------
10511 94 33 - 126 N/A
10512 96 32 - 127 N/A
10513 94x94 33 - 126 33 - 126
10514 96x96 32 - 127 32 - 127
10515 @end example
10516
10517 Note that in the above cases position codes do not start at an
10518 expected value such as 0 or 1. The reason for this will become clear
10519 later.
10520
10521 For example, Latin-1 is a 96-character charset, and JISX0208 (the
10522 Japanese national character set) is a 94x94-character charset.
10523
10524 [Note that, although the ranges above define the @emph{valid} position
10525 codes for a charset, some of the slots in a particular charset may in
10526 fact be empty. This is the case for JISX0208, for example, where (e.g.)
10527 all the slots whose first position code is in the range 118 - 127 are
10528 empty.]
10529
10530 There are three charsets that do not follow the above rules. All of
10531 them have one dimension, and have ranges of position codes as follows:
10532
10533 @example
10534 Charset name Position code 1
10535 ------------------------------------
10536 ASCII 0 - 127
10537 Control-1 0 - 31
10538 Composite 0 - some large number
10539 @end example
10540
10541 (The upper bound of the position code for composite characters has not
10542 yet been determined, but it will probably be at least 16,383).
10543
10544 ASCII is the union of two subsidiary character sets: Printing-ASCII
10545 (the printing ASCII character set, consisting of position codes 33 -
10546 126, like for a standard 94-character charset) and Control-ASCII (the
10547 non-printing characters that would appear in a binary file with codes 0
10548 - 32 and 127).
10549
10550 Control-1 contains the non-printing characters that would appear in a
10551 binary file with codes 128 - 159.
10552
10553 Composite contains characters that are generated by overstriking one
10554 or more characters from other charsets.
10555
10556 Note that some characters in ASCII, and all characters in Control-1,
10557 are @dfn{control} (non-printing) characters. These have no printed
10558 representation but instead control some other function of the printing
10559 (e.g. TAB or 8 moves the current character position to the next tab
10560 stop). All other characters in all charsets are @dfn{graphic}
10561 (printing) characters.
10562
10563 When a binary file is read in, the bytes in the file are assigned to
10564 character sets as follows:
10565
10566 @example
10567 Bytes Character set Range
10568 --------------------------------------------------
10569 0 - 127 ASCII 0 - 127
10570 128 - 159 Control-1 0 - 31
10571 160 - 255 Latin-1 32 - 127
10572 @end example
10573
10574 This is a bit ad-hoc but gets the job done.
10575
10576 @node Encodings, Internal Mule Encodings, Character Sets, Multilingual Support
10577 @section Encodings
10578 @cindex encodings, Mule
10579 @cindex Mule encodings
10580
10581 An @dfn{encoding} is a way of numerically representing characters from
10582 one or more character sets. If an encoding only encompasses one
10583 character set, then the position codes for the characters in that
10584 character set could be used directly. This is not possible, however, if
10585 more than one character set is to be used in the encoding.
10586
10587 For example, the conversion detailed above between bytes in a binary
10588 file and characters is effectively an encoding that encompasses the
10589 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit
10590 bytes.
10591
10592 Thus, an encoding can be viewed as a way of encoding characters from a
10593 specified group of character sets using a stream of bytes, each of which
10594 contains a fixed number of bits (but not necessarily 8, as in the common
10595 usage of ``byte'').
10596
10597 Here are descriptions of a couple of common
10598 encodings:
10599
10600 @menu
10601 * Japanese EUC (Extended Unix Code)::
10602 * JIS7::
10603 @end menu
10604
10605 @node Japanese EUC (Extended Unix Code), JIS7, Encodings, Encodings
10606 @subsection Japanese EUC (Extended Unix Code)
10607 @cindex Japanese EUC (Extended Unix Code)
10608 @cindex EUC (Extended Unix Code), Japanese
10609 @cindex Extended Unix Code, Japanese EUC
10610
10611 This encompasses the character sets Printing-ASCII, Katakana-JISX0201
10612 (half-width katakana, the right half of JISX0201), Japanese-JISX0208,
10613 and Japanese-JISX0212.
10614
10615 Note that Printing-ASCII and Katakana-JISX0201 are 94-character
10616 charsets, while Japanese-JISX0208 and Japanese-JISX0212 are
10617 94x94-character charsets.
10618
10619 The encoding is as follows:
10620
10621 @example
10622 Character set Representation (PC=position-code)
10623 ------------- --------------
10624 Printing-ASCII PC1
10625 Katakana-JISX0201 0x8E | PC1 + 0x80
10626 Japanese-JISX0208 PC1 + 0x80 | PC2 + 0x80
10627 Japanese-JISX0212 PC1 + 0x80 | PC2 + 0x80
10628 @end example
10629
10630 Note that there are other versions of EUC for other Asian languages.
10631 EUC in general is characterized by
10632
10633 @enumerate
10634 @item
10635 row-column encoding,
10636 @item
10637 big-endian (row-first) ordering, and
10638 @item
10639 ASCII compatibility in variable width forms.
10640 @end enumerate
10641
10642 @node JIS7, , Japanese EUC (Extended Unix Code), Encodings
10643 @subsection JIS7
10644 @cindex JIS7
10645
10646 This encompasses the character sets Printing-ASCII,
10647 Latin-JISX0201 (the left half of JISX0201; this character set
10648 is very similar to Printing-ASCII and is a 94-character charset),
10649 Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes.
10650
10651 Unlike EUC, this is a @dfn{modal} encoding, which means that there are
10652 multiple states that the encoding can be in, which affect how the bytes
10653 are to be interpreted. Special sequences of bytes (called @dfn{escape
10654 sequences}) are used to change states.
10655
10656 The encoding is as follows:
10657
10658 @example
10659 Character set Representation (PC=position-code)
10660 ------------- --------------
10661 Printing-ASCII PC1
10662 Latin-JISX0201 PC1
10663 Katakana-JISX0201 PC1
10664 Japanese-JISX0208 PC1 | PC2
10665
10666
10667 Escape sequence ASCII equivalent Meaning
10668 --------------- ---------------- -------
10669 0x1B 0x28 0x4A ESC ( J invoke Latin-JISX0201
10670 0x1B 0x28 0x49 ESC ( I invoke Katakana-JISX0201
10671 0x1B 0x24 0x42 ESC $ B invoke Japanese-JISX0208
10672 0x1B 0x28 0x42 ESC ( B invoke Printing-ASCII
10673 @end example
10674
10675 Initially, Printing-ASCII is invoked.
10676
10677 @node Internal Mule Encodings, Byte/Character Types; Buffer Positions; Other Typedefs, Encodings, Multilingual Support
10678 @section Internal Mule Encodings
10679 @cindex internal Mule encodings
10680 @cindex Mule encodings, internal
10681 @cindex encodings, internal Mule
10682
10683 In XEmacs/Mule, each character set is assigned a unique number, called a
10684 @dfn{leading byte}. This is used in the encodings of a character.
10685 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has
10686 a leading byte of 0), although some leading bytes are reserved.
10687
10688 Charsets whose leading byte is in the range 0x80 - 0x9F are called
10689 @dfn{official} and are used for built-in charsets. Other charsets are
10690 called @dfn{private} and have leading bytes in the range 0xA0 - 0xFF;
10691 these are user-defined charsets.
10692
10693 More specifically:
10694
10695 @example
10696 Character set Leading byte
10697 ------------- ------------
10698 ASCII 0 (0x7F in arrays indexed by leading byte)
10699 Composite 0x8D
10700 Dimension-1 Official 0x80 - 0x8C/0x8D
10701 (0x8E is free)
10702 Control 0x8F
10703 Dimension-2 Official 0x90 - 0x99
10704 (0x9A - 0x9D are free)
10705 Dimension-1 Private Marker 0x9E
10706 Dimension-2 Private Marker 0x9F
10707 Dimension-1 Private 0xA0 - 0xEF
10708 Dimension-2 Private 0xF0 - 0xFF
10709 @end example
10710
10711 There are two internal encodings for characters in XEmacs/Mule. One is
10712 called @dfn{string encoding} and is an 8-bit encoding that is used for
10713 representing characters in a buffer or string. It uses 1 to 4 bytes per
10714 character. The other is called @dfn{character encoding} and is a 19-bit
10715 encoding that is used for representing characters individually in a
10716 variable.
10717
10718 (In the following descriptions, we'll ignore composite characters for
10719 the moment. We also give a general (structural) overview first,
10720 followed later by the exact details.)
10721
10722 @menu
10723 * Internal String Encoding::
10724 * Internal Character Encoding::
10725 @end menu
10726
10727 @node Internal String Encoding, Internal Character Encoding, Internal Mule Encodings, Internal Mule Encodings
10728 @subsection Internal String Encoding
10729 @cindex internal string encoding
10730 @cindex string encoding, internal
10731 @cindex encoding, internal string
10732
10733 ASCII characters are encoded using their position code directly. Other
10734 characters are encoded using their leading byte followed by their
10735 position code(s) with the high bit set. Characters in private character
10736 sets have their leading byte prefixed with a @dfn{leading byte prefix},
10737 which is either 0x9E or 0x9F. (No character sets are ever assigned these
10738 leading bytes.) Specifically:
10739
10740 @example
10741 Character set Encoding (PC=position-code, LB=leading-byte)
10742 ------------- --------
10743 ASCII PC-1 |
10744 Control-1 LB | PC1 + 0xA0 |
10745 Dimension-1 official LB | PC1 + 0x80 |
10746 Dimension-1 private 0x9E | LB | PC1 + 0x80 |
10747 Dimension-2 official LB | PC1 + 0x80 | PC2 + 0x80 |
10748 Dimension-2 private 0x9F | LB | PC1 + 0x80 | PC2 + 0x80
10749 @end example
10750
10751 The basic characteristic of this encoding is that the first byte
10752 of all characters is in the range 0x00 - 0x9F, and the second and
10753 following bytes of all characters is in the range 0xA0 - 0xFF.
10754 This means that it is impossible to get out of sync, or more
10755 specifically:
10756
10757 @enumerate
10758 @item
10759 Given any byte position, the beginning of the character it is
10760 within can be determined in constant time.
10761 @item
10762 Given any byte position at the beginning of a character, the
10763 beginning of the next character can be determined in constant
10764 time.
10765 @item
10766 Given any byte position at the beginning of a character, the
10767 beginning of the previous character can be determined in constant
10768 time.
10769 @item
10770 Textual searches can simply treat encoded strings as if they
10771 were encoded in a one-byte-per-character fashion rather than
10772 the actual multi-byte encoding.
10773 @end enumerate
10774
10775 None of the standard non-modal encodings meet all of these
10776 conditions. For example, EUC satisfies only (2) and (3), while
10777 Shift-JIS and Big5 (not yet described) satisfy only (2). (All
10778 non-modal encodings must satisfy (2), in order to be unambiguous.)
10779
10780 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings
10781 @subsection Internal Character Encoding
10782 @cindex internal character encoding
10783 @cindex character encoding, internal
10784 @cindex encoding, internal character
10785
10786 One 19-bit word represents a single character. The word is
10787 separated into three fields:
10788
10789 @example
10790 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
10791 <------------> <------------------> <------------------>
10792 Field: 1 2 3
10793 @end example
10794
10795 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits.
10796
10797 @example
10798 Character set Field 1 Field 2 Field 3
10799 ------------- ------- ------- -------
10800 ASCII 0 0 PC1
10801 range: (00 - 7F)
10802 Control-1 0 1 PC1
10803 range: (00 - 1F)
10804 Dimension-1 official 0 LB - 0x7F PC1
10805 range: (01 - 0D) (20 - 7F)
10806 Dimension-1 private 0 LB - 0x80 PC1
10807 range: (20 - 6F) (20 - 7F)
10808 Dimension-2 official LB - 0x8F PC1 PC2
10809 range: (01 - 0A) (20 - 7F) (20 - 7F)
10810 Dimension-2 private LB - 0xE1 PC1 PC2
10811 range: (0F - 1E) (20 - 7F) (20 - 7F)
10812 Composite 0x1F ? ?
10813 @end example
10814
10815 Note that character codes 0 - 255 are the same as the ``binary
10816 encoding'' described above.
10817
10818 Most of the code in XEmacs knows nothing of the representation of a
10819 character other than that values 0 - 255 represent ASCII, Control 1,
10820 and Latin 1.
10821
10822 @strong{WARNING WARNING WARNING}: The Boyer-Moore code in
10823 @file{search.c}, and the code in @code{search_buffer()} that determines
10824 whether that code can be used, knows that ``field 3'' in a character
10825 always corresponds to the last byte in the textual representation of the
10826 character. (This is important because the Boyer-Moore algorithm works by
10827 looking at the last byte of the search string and &&#### finish this.
10828
10829 @node Byte/Character Types; Buffer Positions; Other Typedefs, Internal Text API's, Internal Mule Encodings, Multilingual Support
10830 @section Byte/Character Types; Buffer Positions; Other Typedefs
10831 @cindex byte/character types; buffer positions; other typedefs
10832 @cindex byte/character types
10833 @cindex character types
10834 @cindex buffer positions
10835 @cindex typedefs, other
10836
10837 @menu
10838 * Byte Types::
10839 * Different Ways of Seeing Internal Text::
10840 * Buffer Positions::
10841 * Other Typedefs::
10842 * Usage of the Various Representations::
10843 * Working With the Various Representations::
10844 @end menu
10845
10846 @node Byte Types, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs
10847 @subsection Byte Types
10848 @cindex byte types
10849
10850 Stuff pointed to by a char * or unsigned char * will nearly always be
10851 one of the following types:
10852
10853 @itemize @minus
10854 @item
10855 a) [Ibyte] pointer to internally-formatted text
10856 @item
10857 b) [Extbyte] pointer to text in some external format, which can be
10858 defined as all formats other than the internal one
10859 @item
10860 c) [Ascbyte] pure ASCII text
10861 @item
10862 d) [Binbyte] binary data that is not meant to be interpreted as text
10863 @item
10864 e) [Rawbyte] general data in memory, where we don't care about whether
10865 it's text or binary
10866 @item
10867 f) [Boolbyte] a zero or a one
10868 @item
10869 g) [Bitbyte] a byte used for bit fields
10870 @item
10871 h) [Chbyte] null-semantics @code{char *}; used when casting an argument to
10872 an external API where the the other types may not be
10873 appropriate
10874 @end itemize
10875
10876 Types (b), (c), (f) and (h) are defined as @code{char}, while the others are
10877 @code{unsigned char}. This is for maximum safety (signed characters are
10878 dangerous to work with) while maintaining as much compatibility with
10879 external API's and string constants as possible.
10880
10881 We also provide versions of the above types defined with different
10882 underlying C types, for API compatibility. These use the following
10883 prefixes:
10884
10885 @example
10886 C = plain char, when the base type is unsigned
10887 U = unsigned
10888 S = signed
10889 @end example
10890
10891 (Formerly I had a comment saying that type (e) "should be replaced with
10892 void *". However, there are in fact many places where an unsigned char
10893 * might be used -- e.g. for ease in pointer computation, since void *
10894 doesn't allow this, and for compatibility with external API's.)
10895
10896 Note that these typedefs are purely for documentation purposes; from
10897 the C code's perspective, they are exactly equivalent to @code{char *},
10898 @code{unsigned char *}, etc., so you can freely use them with library
10899 functions declared as such.
10900
10901 Using these more specific types rather than the general ones helps avoid
10902 the confusions that occur when the semantics of a char * or unsigned
10903 char * argument being studied are unclear. Furthermore, by requiring
10904 that ALL uses of @code{char} be replaced with some other type as part of the
10905 Mule-ization process, we can use a search for @code{char} as a way of finding
10906 code that has not been properly Mule-ized yet.
10907
10908 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs
10909 @subsection Different Ways of Seeing Internal Text
10910 @cindex different ways of seeing internal text
10911
10912 There are various ways of representing internal text. The two primary
10913 ways are as an "array" of individual characters; the other is as a
10914 "stream" of bytes. In the ASCII world, where there are only 255
10915 characters at most, things are easy because each character fits into a
10916 byte. In general, however, this is not true -- see the above discussion
10917 of characters vs. encodings.
10918
10919 In some cases, it's also important to distinguish between a stream
10920 representation as a series of bytes and as a series of textual units.
10921 This is particularly important wrt Unicode. The UTF-16 representation
10922 (sometimes referred to, rather sloppily, as simply the "Unicode" format)
10923 represents text as a series of 16-bit units. Mostly, each unit
10924 corresponds to a single character, but not necessarily, as characters
10925 outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of
10926 Unicode) require two 16-bit units, through the mechanism of
10927 "surrogates". When a series of 16-bit units is serialized into a byte
10928 stream, there are at least two possible representations, little-endian
10929 and big-endian, and which one is used may depend on the native format of
10930 16-bit integers in the CPU of the machine that XEmacs is running
10931 on. (Similarly, UTF-32 is logically a representation with 32-bit textual
10932 units.)
10933
10934 Specifically:
10935
10936 @itemize @minus
10937 @item
10938 UTF-8 has 1-byte (8-bit) units.
10939 @item
10940 UTF-16 has 2-byte (16-bit) units.
10941 @item
10942 UTF-32 has 4-byte (32-bit) units.
10943 @item
10944 XEmacs-internal encoding (the old "Mule" encoding) has 1-byte (8-bit)
10945 units.
10946 @item
10947 UTF-7 technically has 7-bit units that are within the "mail-safe" range
10948 (ASCII 32 - 126 plus a few control characters), but normally is encoded
10949 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a
10950 normal mode where printable ASCII characters represent themselves and a
10951 shifted mode, introduced with a plus sign, where a base-64 encoding is
10952 used.)
10953 @item
10954 UTF-5 technically has 7-bit units (normally encoded in an 8-bit stream,
10955 like UTF-7), but only uses uppercase A-V and 0-9, and only encodes 4
10956 bits worth of data per character. UTF-5 is meant for encoding Unicode
10957 inside of DNS names.
10958 @end itemize
10959
10960 Thus, we can imagine three levels in the representation of texual data:
10961
10962 @example
10963 series of characters -> series of textual units -> series of bytes
10964 [Ichar] [Itext] [Ibyte]
10965 @end example
10966
10967 XEmacs has three corresponding typedefs:
10968
10969 @itemize @minus
10970 @item
10971 An Ichar is an integer (at least 32-bit), representing a 31-bit
10972 character.
10973 @item
10974 An Itext is an unsigned value, either 8, 16 or 32 bits, depending
10975 on the nature of the internal representation, and corresponding to
10976 a single textual unit.
10977 @item
10978 An Ibyte is an @code{unsigned char}, representing a single byte in a
10979 textual byte stream.
10980 @end itemize
10981
10982 Internal text in stream format can be simultaneously viewed as either
10983 @code{Itext *} or @code{Ibyte *}. The @code{Ibyte *} representation is convenient for
10984 copying data from one place to another, because such routines usually
10985 expect byte counts. However, @code{Itext *} is much better for actually
10986 working with the data.
10987
10988 From a text-unit perspective, units 0 through 127 will always be ASCII
10989 compatible, and data in Lisp strings (and other textual data generated
10990 as a whole, e.g. from external conversion) will be followed by a
10991 null-unit terminator. From an @code{Ibyte *} perspective, however, the
10992 encoding is only ASCII-compatible if it uses 1-byte units.
10993
10994 Similarly to the different text representations, three integral count
10995 types exist -- Charcount, Textcount and Bytecount.
10996
10997 NOTE: Despite the presence of the terminator, internal text itself can
10998 have nulls in it! (Null text units, not just the null bytes present in
10999 any UTF-16 encoding.) The terminator is present because in many cases
11000 internal text is passed to routines that will ultimately pass the text
11001 to library functions that cannot handle embedded nulls, e.g. functions
11002 manipulating filenames, and it is a real hassle to have to pass the
11003 length around constantly. But this can lead to sloppy coding! We need
11004 to be careful about watching for nulls in places that are important,
11005 e.g. manipulating string objects or passing data to/from the clipboard.
11006
11007 @table @code
11008 @item Ibyte
11009 The data in a buffer or string is logically made up of Ibyte objects,
11010 where a Ibyte takes up the same amount of space as a char. (It is
11011 declared differently, though, to catch invalid usages.) Strings stored
11012 using Ibytes are said to be in "internal format". The important
11013 characteristics of internal format are
11014
11015 @itemize @minus
11016 @item
11017 ASCII characters are represented as a single Ibyte, in the range 0 -
11018 0x7f.
11019 @item
11020 All other characters are represented as a Ibyte in the range 0x80 - 0x9f
11021 followed by one or more Ibytes in the range 0xa0 to 0xff.
11022 @end itemize
11023
11024 This leads to a number of desirable properties:
11025
11026 @itemize @minus
11027 @item
11028 Given the position of the beginning of a character, you can find the
11029 beginning of the next or previous character in constant time.
11030 @item
11031 When searching for a substring or an ASCII character within the string,
11032 you need merely use standard searching routines.
11033 @end itemize
11034
11035 @item Itext
11036
11037 #### Document me.
11038
11039 @item Ichar
11040 This typedef represents a single Emacs character, which can be ASCII,
11041 ISO-8859, or some extended character, as would typically be used for
11042 Kanji. Note that the representation of a character as an Ichar is @strong{not}
11043 the same as the representation of that same character in a string; thus,
11044 you cannot do the standard C trick of passing a pointer to a character
11045 to a function that expects a string.
11046
11047 An Ichar takes up 19 bits of representation and (for code compatibility
11048 and such) is compatible with an int. This representation is visible on
11049 the Lisp level. The important characteristics of the Ichar
11050 representation are
11051
11052 @itemize @minus
11053 @item
11054 values 0x00 - 0x7f represent ASCII.
11055 @item
11056 values 0x80 - 0xff represent the right half of ISO-8859-1.
11057 @item
11058 values 0x100 and up represent all other characters.
11059 @end itemize
11060
11061 This means that Ichar values are upwardly compatible with the standard
11062 8-bit representation of ASCII/ISO-8859-1.
11063
11064 @item Extbyte
11065 Strings that go in or out of Emacs are in "external format", typedef'ed
11066 as an array of char or a char *. There is more than one external format
11067 (JIS, EUC, etc.) but they all have similar properties. They are modal
11068 encodings, which is to say that the meaning of particular bytes is not
11069 fixed but depends on what "mode" the string is currently in (e.g. bytes
11070 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or
11071 as 2-byte Kanji, depending on the current mode). The mode starts out in
11072 ASCII/ISO-8859-1 and is switched using escape sequences -- for example,
11073 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes
11074 in the range 0 - 0x7f are interpreted as Kanji characters.
11075
11076 External-formatted data is generally desirable for passing data between
11077 programs because it is upwardly compatible with standard
11078 ASCII/ISO-8859-1 strings and may require less space than internal
11079 encodings such as the one described above. In addition, some encodings
11080 (e.g. JIS) keep all characters (except the ESC used to switch modes) in
11081 the printing ASCII range 0x20 - 0x7e, which results in a much higher
11082 probability that the data will avoid being garbled in transmission.
11083 Externally-formatted data is generally not very convenient to work with,
11084 however, and for this reason is usually converted to internal format
11085 before any work is done on the string.
11086
11087 NOTE: filenames need to be in external format so that ISO-8859-1
11088 characters come out correctly.
11089 @end table
11090
11091 @node Buffer Positions, Other Typedefs, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs
11092 @subsection Buffer Positions
11093 @cindex buffer positions
11094
11095 There are three possible ways to specify positions in a buffer. All
11096 of these are one-based: the beginning of the buffer is position or
11097 index 1, and 0 is not a valid position.
11098
11099 As a "buffer position" (typedef Charbpos):
11100
11101 This is an index specifying an offset in characters from the
11102 beginning of the buffer. Note that buffer positions are
11103 logically @strong{between} characters, not on a character. The
11104 difference between two buffer positions specifies the number of
11105 characters between those positions. Buffer positions are the
11106 only kind of position externally visible to the user.
11107
11108 As a "byte index" (typedef Bytebpos):
11109
11110 This is an index over the bytes used to represent the characters
11111 in the buffer. If there is no Mule support, this is identical
11112 to a buffer position, because each character is represented
11113 using one byte. However, with Mule support, many characters
11114 require two or more bytes for their representation, and so a
11115 byte index may be greater than the corresponding buffer
11116 position.
11117
11118 As a "memory index" (typedef Membpos):
11119
11120 This is the byte index adjusted for the gap. For positions
11121 before the gap, this is identical to the byte index. For
11122 positions after the gap, this is the byte index plus the gap
11123 size. There are two possible memory indices for the gap
11124 position; the memory index at the beginning of the gap should
11125 always be used, except in code that deals with manipulating the
11126 gap, where both indices may be seen. The address of the
11127 character "at" (i.e. following) a particular position can be
11128 obtained from the formula
11129
11130 buffer_start_address + memory_index(position) - 1
11131
11132 except in the case of characters at the gap position.
11133
11134 @node Other Typedefs, Usage of the Various Representations, Buffer Positions, Byte/Character Types; Buffer Positions; Other Typedefs
11135 @subsection Other Typedefs
11136 @cindex other typedefs
11137
11138 Charcount:
11139 ----------
11140 This typedef represents a count of characters, such as
11141 a character offset into a string or the number of
11142 characters between two positions in a buffer. The
11143 difference between two Charbpos's is a Charcount, and
11144 character positions in a string are represented using
11145 a Charcount.
11146
11147 Textcount:
11148 ----------
11149 #### Document me.
11150
11151 Bytecount:
11152 ----------
11153 Similar to a Charcount but represents a count of bytes.
11154 The difference between two Bytebpos's is a Bytecount.
11155
11156
11157 @node Usage of the Various Representations, Working With the Various Representations, Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs
11158 @subsection Usage of the Various Representations
11159 @cindex usage of the various representations
11160
11161 Memory indices are used in low-level functions in insdel.c and for
11162 extent endpoints and marker positions. The reason for this is that
11163 this way, the extents and markers don't need to be updated for most
11164 insertions, which merely shrink the gap and don't move any
11165 characters around in memory.
11166
11167 (The beginning-of-gap memory index simplifies insertions w.r.t.
11168 markers, because text usually gets inserted after markers. For
11169 extents, it is merely for consistency, because text can get
11170 inserted either before or after an extent's endpoint depending on
11171 the open/closedness of the endpoint.)
11172
11173 Byte indices are used in other code that needs to be fast,
11174 such as the searching, redisplay, and extent-manipulation code.
11175
11176 Buffer positions are used in all other code. This is because this
11177 representation is easiest to work with (especially since Lisp
11178 code always uses buffer positions), necessitates the fewest
11179 changes to existing code, and is the safest (e.g. if the text gets
11180 shifted underneath a buffer position, it will still point to a
11181 character; if text is shifted under a byte index, it might point
11182 to the middle of a character, which would be bad).
11183
11184 Similarly, Charcounts are used in all code that deals with strings
11185 except for code that needs to be fast, which used Bytecounts.
11186
11187 Strings are always passed around internally using internal format.
11188 Conversions between external format are performed at the time
11189 that the data goes in or out of Emacs.
11190
11191 @node Working With the Various Representations, , Usage of the Various Representations, Byte/Character Types; Buffer Positions; Other Typedefs
11192 @subsection Working With the Various Representations
11193 @cindex working with the various representations
11194
11195 We write things this way because it's very important the
11196 MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens,
11197 65535 is a multiple of 3, but this may not always be the
11198 case. #### unfinished
11199
11200 @node Internal Text API's, Coding for Mule, Byte/Character Types; Buffer Positions; Other Typedefs, Multilingual Support
11201 @section Internal Text API's
11202 @cindex internal text API's
11203 @cindex text API's, internal
11204 @cindex API's, text, internal
11205
11206 @strong{NOTE}: The most current documentation for these API's is in
11207 @file{text.h}. In case of error, assume that file is correct and this
11208 one wrong.
11209
11210 @menu
11211 * Basic internal-format API's::
11212 * The DFC API::
11213 * The Eistring API::
11214 @end menu
11215
11216 @node Basic internal-format API's, The DFC API, Internal Text API's, Internal Text API's
11217 @subsection Basic internal-format API's
11218 @cindex basic internal-format API's
11219 @cindex internal-format API's, basic
11220 @cindex API's, basic internal-format
11221
11222 These are simple functions and macros to convert between text
11223 representation and characters, move forward and back in text, etc.
11224
11225 #### Finish the rest of this.
11226
11227 Use the following functions/macros on contiguous text in any of the
11228 internal formats. Those that take a format arg work on all internal
11229 formats; the others work only on the default (variable-width under Mule)
11230 format. If the text you're operating on is known to come from a buffer,
11231 use the buffer-level functions in buffer.h, which automatically know the
11232 correct format and handle the gap.
11233
11234 Some terminology:
11235
11236 "itext" appearing in the macros means "internal-format text" -- type
11237 @code{Ibyte *}. Operations on such pointers themselves, rather than on the
11238 text being pointed to, have "itext" instead of "itext" in the macro
11239 name. "ichar" in the macro names means an Ichar -- the representation
11240 of a character as a single integer rather than a series of bytes, as part
11241 of "itext". Many of the macros below are for converting between the
11242 two representations of characters.
11243
11244 Note also that we try to consistently distinguish between an "Ichar" and
11245 a Lisp character. Stuff working with Lisp characters often just says
11246 "char", so we consistently use "Ichar" when that's what we're working
11247 with.
11248
11249 @node The DFC API, The Eistring API, Basic internal-format API's, Internal Text API's
11250 @subsection The DFC API
11251 @cindex DFC API
11252 @cindex API, DFC
11253
11254 This is for conversion between internal and external text. Note that
11255 there is also the "new DFC" API, which @strong{returns} a pointer to the
11256 converted text (in alloca space), rather than storing it into a
11257 variable.
11258
11259 The macros below are used for converting data between different formats.
11260 Generally, the data is textual, and the formats are related to
11261 internationalization (e.g. converting between internal-format text and
11262 UTF-8) -- but the mechanism is general, and could be used for anything,
11263 e.g. decoding gzipped data.
11264
11265 In general, conversion involves a source of data, a sink, the existing
11266 format of the source data, and the desired format of the sink. The
11267 macros below, however, always require that either the source or sink is
11268 internal-format text. Therefore, in practice the conversions below
11269 involve source, sink, an external format (specified by a coding system),
11270 and the direction of conversion (internal->external or vice-versa).
11271
11272 Sources and sinks can be raw data (sized or unsized -- when unsized,
11273 input data is assumed to be null-terminated [double null-terminated for
11274 Unicode-format data], and on output the length is not stored anywhere),
11275 Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the
11276 output is raw data, the result can be allocated either with @code{alloca()} or
11277 @code{malloc()}. (There is currently no provision for writing into a fixed
11278 buffer. If you want this, use @code{alloca()} output and then copy the data --
11279 but be careful with the size! Unless you are very sure of the encoding
11280 being used, upper bounds for the size are not in general computable.)
11281 The obvious restrictions on source and sink types apply (e.g. Lisp
11282 strings are a source and sink only for internal data).
11283
11284 All raw data outputted will contain an extra null byte (two bytes for
11285 Unicode -- currently, in fact, all output data, whether internal or
11286 external, is double-null-terminated, but you can't count on this; see
11287 below). This means that enough space is allocated to contain the extra
11288 nulls; however, these nulls are not reflected in the returned output
11289 size.
11290
11291 The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT.
11292 These can be used to convert between any kinds of sources or sinks.
11293 However, 99% of conversions involve raw data or Lisp strings as both
11294 source and sink, and usually data is output as @code{alloca()} rather than
11295 @code{malloc()}. For this reason, convenience macros are defined for many types
11296 of conversions involving raw data and/or Lisp strings, especially when
11297 the output is an @code{alloca()}ed string. (When the destination is a
11298 Lisp_String, there are other functions that should be used instead --
11299 @code{build_ext_string()} and @code{make_ext_string()}, for example.) The convenience
11300 macros are of two types -- the older kind that store the result into a
11301 specified variable, and the newer kind that return the result. The newer
11302 kind of macros don't exist when the output is sized data, because that
11303 would have two return values. NOTE: All convenience macros are
11304 ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT.
11305 Thus, any comments below about the workings of these macros also apply to
11306 all convenience macros.
11307
11308 @example
11309 TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
11310 TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
11311 @end example
11312
11313 Typical use is
11314
11315 @example
11316 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);
11317 @end example
11318
11319 which means that the contents of the lisp string @var{str} are written
11320 to a malloc'ed memory area which will be pointed to by @var{ptr}, after the
11321 function returns. The conversion will be done using the @code{file-name}
11322 coding system (which will be controlled by the user indirectly by
11323 setting or binding the variable @code{file-name-coding-system}).
11324
11325 Some sources and sinks require two C variables to specify. We use
11326 some preprocessor magic to allow different source and sink types, and
11327 even different numbers of arguments to specify different types of
11328 sources and sinks.
11329
11330 So we can have a call that looks like
11331
11332 @example
11333 TO_INTERNAL_FORMAT (DATA, (ptr, len),
11334 MALLOC, (ptr, len),
11335 coding_system);
11336 @end example
11337
11338 The parenthesized argument pairs are required to make the
11339 preprocessor magic work.
11340
11341 NOTE: GC is inhibited during the entire operation of these macros. This
11342 is because frequently the data to be converted comes from strings but
11343 gets passed in as just DATA, and GC may move around the string data. If
11344 we didn't inhibit GC, there'd have to be a lot of messy recoding,
11345 alloca-copying of strings and other annoying stuff.
11346
11347 The source or sink can be specified in one of these ways:
11348
11349 @example
11350 DATA, (ptr, len), // input data is a fixed buffer of size len
11351 ALLOCA, (ptr, len), // output data is in a @code{ALLOCA()}ed buffer of size len
11352 MALLOC, (ptr, len), // output data is in a @code{malloc()}ed buffer of size len
11353 C_STRING_ALLOCA, ptr, // equivalent to ALLOCA (ptr, len_ignored) on output
11354 C_STRING_MALLOC, ptr, // equivalent to MALLOC (ptr, len_ignored) on output
11355 C_STRING, ptr, // equivalent to DATA, (ptr, strlen/wcslen (ptr))
11356 // on input (the Unicode version is used when correct)
11357 LISP_STRING, string, // input or output is a Lisp_Object of type string
11358 LISP_BUFFER, buffer, // output is written to (point) in lisp buffer
11359 LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream
11360 LISP_OPAQUE, object, // input or output is a Lisp_Object of type opaque
11361 @end example
11362
11363 When specifying the sink, use lvalues, since the macro will assign to them,
11364 except when the sink is an lstream or a lisp buffer.
11365
11366 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the resulting text is
11367 stored in a stack-allocated buffer, which is automatically freed on
11368 returning from the function. However, the sink types @code{MALLOC} and
11369 @code{C_STRING_MALLOC} return @code{xmalloc()}ed memory. The caller is responsible
11370 for freeing this memory using @code{xfree()}.
11371
11372 The macros accept the kinds of sources and sinks appropriate for
11373 internal and external data representation. See the type_checking_assert
11374 macros below for the actual allowed types.
11375
11376 Since some sources and sinks use one argument (a Lisp_Object) to
11377 specify them, while others take a (pointer, length) pair, we use
11378 some C preprocessor trickery to allow pair arguments to be specified
11379 by parenthesizing them, as in the examples above.
11380
11381 Anything prefixed by dfc_ (`data format conversion') is private.
11382 They are only used to implement these macros.
11383
11384 [[Using C_STRING* is appropriate for using with external APIs that
11385 take null-terminated strings. For internal data, we should try to
11386 be '\0'-clean - i.e. allow arbitrary data to contain embedded '\0'.
11387
11388 Sometime in the future we might allow output to C_STRING_ALLOCA or
11389 C_STRING_MALLOC _only_ with @code{TO_EXTERNAL_FORMAT()}, not
11390 @code{TO_INTERNAL_FORMAT()}.]]
11391
11392 The above comments are not true. Frequently (most of the time, in
11393 fact), external strings come as zero-terminated entities, where the
11394 zero-termination is the only way to find out the length. Even in
11395 cases where you can get the length, most of the time the system will
11396 still use the null to signal the end of the string, and there will
11397 still be no way to either send in or receive a string with embedded
11398 nulls. In such situations, it's pointless to track the length
11399 because null bytes can never be in the string. We have a lot of
11400 operations that make it easy to operate on zero-terminated strings,
11401 and forcing the user the deal with the length everywhere would only
11402 make the code uglier and more complicated, for no gain. --ben
11403
11404 There is no problem using the same lvalue for source and sink.
11405
11406 Also, when pointers are required, the code (currently at least) is
11407 lax and allows any pointer types, either in the source or the sink.
11408 This makes it possible, e.g., to deal with internal format data held
11409 in char *'s or external format data held in WCHAR * (i.e. Unicode).
11410
11411 Finally, whenever storage allocation is called for, extra space is
11412 allocated for a terminating zero, and such a zero is stored in the
11413 appropriate place, regardless of whether the source data was
11414 specified using a length or was specified as zero-terminated. This
11415 allows you to freely pass the resulting data, no matter how
11416 obtained, to a routine that expects zero termination (modulo, of
11417 course, that any embedded zeros in the resulting text will cause
11418 truncation). In fact, currently two embedded zeros are allocated
11419 and stored after the data result. This is to allow for the
11420 possibility of storing a Unicode value on output, which needs the
11421 two zeros. Currently, however, the two zeros are stored regardless
11422 of whether the conversion is internal or external and regardless of
11423 whether the external coding system is in fact Unicode. This
11424 behavior may change in the future, and you cannot rely on this --
11425 the most you can rely on is that sink data in Unicode format will
11426 have two terminating nulls, which combine to form one Unicode null
11427 character.
11428
11429 NOTE: You might ask, why are these not written as functions that
11430 @strong{RETURN} the converted string, since that would allow them to be used
11431 much more conveniently, without having to constantly declare temporary
11432 variables? The answer is that in fact I originally did write the
11433 routines that way, but that required either
11434
11435 @itemize @bullet
11436 @item
11437 (a) calling @code{alloca()} inside of a function call, or
11438 @item
11439 (b) using expressions separated by commas and a global temporary variable, or
11440 @item
11441 (c) using the GCC extension (@{ ... @}).
11442 @end itemize
11443
11444 Turned out that all of the above had bugs, all caused by GCC (hence the
11445 comments about "those GCC wankers" and "ream gcc up the ass"). As for
11446 (a), some versions of GCC (especially on Intel platforms), which had
11447 buggy implementations of @code{alloca()} that couldn't handle being called
11448 inside of a function call -- they just decremented the stack right in the
11449 middle of pushing args. Oops, crash with stack trashing, very bad. (b)
11450 was an attempt to fix (a), and that led to further GCC crashes, esp. when
11451 you had two such calls in a single subexpression, because GCC couldn't be
11452 counted upon to follow even a minimally reasonable order of execution.
11453 True, you can't count on one argument being evaluated before another, but
11454 GCC would actually interleave them so that the temp var got stomped on by
11455 one while the other was accessing it. So I tried (c), which was
11456 problematic because that GCC extension has more bugs in it than a
11457 termite's nest.
11458
11459 So reluctantly I converted to the current way. Now, that was awhile ago
11460 (c. 1994), and it appears that the bug involving alloca in function calls
11461 has long since been fixed. More recently, I defined the new-dfc routines
11462 down below, which DO allow exactly such convenience of returning your
11463 args rather than store them in temp variables, and I also wrote a
11464 configure check to see whether @code{alloca()} causes crashes inside of function
11465 calls, and if so use the portable @code{alloca()} implementation in alloca.c.
11466 If you define TEST_NEW_DFC, the old routines get written in terms of the
11467 new ones, and I've had a beta put out with this on and it appeared to
11468 this appears to cause no problems -- so we should consider
11469 switching, and feel no compunctions about writing further such function-
11470 like @code{alloca()} routines in lieu of statement-like ones. --ben
11471
11472 @node The Eistring API, , The DFC API, Internal Text API's
11473 @subsection The Eistring API
11474 @cindex Eistring API
11475 @cindex API, Eistring
11476
11477 (This API is currently under-used) When doing simple things with
11478 internal text, the basic internal-format API's are enough. But to do
11479 things like delete or replace a substring, concatenate various strings,
11480 etc. is difficult to do cleanly because of the allocation issues.
11481 The Eistring API is designed to deal with this, and provides a clean
11482 way of modifying and building up internal text. (Note that the former
11483 lack of this API has meant that some code uses Lisp strings to do
11484 similar manipulations, resulting in excess garbage and increased
11485 garbage collection.)
11486
11487 NOTE: The Eistring API is (or should be) Mule-correct even without
11488 an ASCII-compatible internal representation.
11489
11490 @example
11491 #### NOTE: This is a work in progress. Neither the API nor especially
11492 the implementation is finished.
11493
11494 NOTE: An Eistring is a structure that makes it easy to work with
11495 internally-formatted strings of data. It provides operations similar
11496 in feel to the standard @code{strcpy()}, @code{strcat()}, @code{strlen()}, etc., but
11497
11498 (a) it is Mule-correct
11499 (b) it does dynamic allocation so you never have to worry about size
11500 restrictions
11501 (c) it comes in an @code{ALLOCA()} variety (all allocation is stack-local,
11502 so there is no need to explicitly clean up) as well as a @code{malloc()}
11503 variety
11504 (d) it knows its own length, so it does not suffer from standard null
11505 byte brain-damage -- but it null-terminates the data anyway, so
11506 it can be passed to standard routines
11507 (e) it provides a much more powerful set of operations and knows about
11508 all the standard places where string data might reside: Lisp_Objects,
11509 other Eistrings, Ibyte * data with or without an explicit length,
11510 ASCII strings, Ichars, etc.
11511 (f) it provides easy operations to convert to/from externally-formatted
11512 data, and is easier to use than the standard TO_INTERNAL_FORMAT
11513 and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal
11514 and external version of its data, but the external version is only
11515 initialized or changed when you call @code{eito_external()}.)
11516
11517 The idea is to make it as easy to write Mule-correct string manipulation
11518 code as it is to write normal string manipulation code. We also make
11519 the API sufficiently general that it can handle multiple internal data
11520 formats (e.g. some fixed-width optimizing formats and a default variable
11521 width format) and allows for @strong{ANY} data format we might choose in the
11522 future for the default format, including UCS2. (In other words, we can't
11523 assume that the internal format is ASCII-compatible and we can't assume
11524 it doesn't have embedded null bytes. We do assume, however, that any
11525 chosen format will have the concept of null-termination.) All of this is
11526 hidden from the user.
11527
11528 #### It is really too bad that we don't have a real object-oriented
11529 language, or at least a language with polymorphism!
11530
11531
11532 **********************************************
11533 * Declaration *
11534 **********************************************
11535
11536 To declare an Eistring, either put one of the following in the local
11537 variable section:
11538
11539 DECLARE_EISTRING (name);
11540 Declare a new Eistring and initialize it to the empy string. This
11541 is a standard local variable declaration and can go anywhere in the
11542 variable declaration section. NAME itself is declared as an
11543 Eistring *, and its storage declared on the stack.
11544
11545 DECLARE_EISTRING_MALLOC (name);
11546 Declare and initialize a new Eistring, which uses @code{malloc()}ed
11547 instead of @code{ALLOCA()}ed data. This is a standard local variable
11548 declaration and can go anywhere in the variable declaration
11549 section. Once you initialize the Eistring, you will have to free
11550 it using @code{eifree()} to avoid memory leaks. You will need to use this
11551 form if you are passing an Eistring to any function that modifies
11552 it (otherwise, the modified data may be in stack space and get
11553 overwritten when the function returns).
11554
11555 or use
11556
11557 Eistring ei;
11558 void eiinit (Eistring *ei);
11559 void eiinit_malloc (Eistring *einame);
11560 If you need to put an Eistring elsewhere than in a local variable
11561 declaration (e.g. in a structure), declare it as shown and then
11562 call one of the init macros.
11563
11564 Also note:
11565
11566 void eifree (Eistring *ei);
11567 If you declared an Eistring to use @code{malloc()} to hold its data,
11568 or converted it to the heap using @code{eito_malloc()}, then this
11569 releases any data in it and afterwards resets the Eistring
11570 using @code{eiinit_malloc()}. Otherwise, it just resets the Eistring
11571 using @code{eiinit()}.
11572
11573
11574 **********************************************
11575 * Conventions *
11576 **********************************************
11577
11578 - The names of the functions have been chosen, where possible, to
11579 match the names of @code{str*()} functions in the standard C API.
11580 -
11581
11582
11583 **********************************************
11584 * Initialization *
11585 **********************************************
11586
11587 void eireset (Eistring *eistr);
11588 Initialize the Eistring to the empty string.
11589
11590 void eicpy_* (Eistring *eistr, ...);
11591 Initialize the Eistring from somewhere:
11592
11593 void eicpy_ei (Eistring *eistr, Eistring *eistr2);
11594 ... from another Eistring.
11595 void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string);
11596 ... from a Lisp_Object string.
11597 void eicpy_ch (Eistring *eistr, Ichar ch);
11598 ... from an Ichar (this can be a conventional C character).
11599
11600 void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string,
11601 Bytecount off, Charcount charoff,
11602 Bytecount len, Charcount charlen);
11603 ... from a section of a Lisp_Object string.
11604 void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf,
11605 Bytecount off, Charcount charoff,
11606 Bytecount len, Charcount charlen);
11607 ... from a section of a Lisp_Object buffer.
11608 void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len);
11609 ... from raw internal-format data in the default internal format.
11610 void eicpy_rawz (Eistring *eistr, const Ibyte *data);
11611 ... from raw internal-format data in the default internal format
11612 that is "null-terminated" (the meaning of this depends on the nature
11613 of the default internal format).
11614 void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len,
11615 Internal_Format intfmt, Lisp_Object object);
11616 ... from raw internal-format data in the specified format.
11617 void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data,
11618 Internal_Format intfmt, Lisp_Object object);
11619 ... from raw internal-format data in the specified format that is
11620 "null-terminated" (the meaning of this depends on the nature of
11621 the specific format).
11622 void eicpy_c (Eistring *eistr, const Ascbyte *c_string);
11623 ... from an ASCII null-terminated string. Non-ASCII characters in
11624 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
11625 void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len);
11626 ... from an ASCII string, with length specified. Non-ASCII characters
11627 in the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
11628 void eicpy_ext (Eistring *eistr, const Extbyte *extdata,
11629 Lisp_Object codesys);
11630 ... from external null-terminated data, with coding system specified.
11631 void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata,
11632 Bytecount extlen, Lisp_Object codesys);
11633 ... from external data, with length and coding system specified.
11634 void eicpy_lstream (Eistring *eistr, Lisp_Object lstream);
11635 ... from an lstream; reads data till eof. Data must be in default
11636 internal format; otherwise, interpose a decoding lstream.
11637
11638
11639 **********************************************
11640 * Getting the data out of the Eistring *
11641 **********************************************
11642
11643 Ibyte *eidata (Eistring *eistr);
11644 Return a pointer to the raw data in an Eistring. This is NOT
11645 a copy.
11646
11647 Lisp_Object eimake_string (Eistring *eistr);
11648 Make a Lisp string out of the Eistring.
11649
11650 Lisp_Object eimake_string_off (Eistring *eistr,
11651 Bytecount off, Charcount charoff,
11652 Bytecount len, Charcount charlen);
11653 Make a Lisp string out of a section of the Eistring.
11654
11655 void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out,
11656 LVALUE: Bytecount len_out);
11657 Make an @code{ALLOCA()} copy of the data in the Eistring, using the
11658 default internal format. Due to the nature of @code{ALLOCA()}, this
11659 must be a macro, with all lvalues passed in as parameters.
11660 (More specifically, not all compilers correctly handle using
11661 @code{ALLOCA()} as the argument to a function call -- GCC on x86
11662 didn't used to, for example.) A pointer to the @code{ALLOCA()}ed data
11663 is stored in PTR_OUT, and the length of the data (not including
11664 the terminating zero) is stored in LEN_OUT.
11665
11666 void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out,
11667 LVALUE: Bytecount len_out,
11668 Internal_Format intfmt, Lisp_Object object);
11669 Like @code{eicpyout_alloca()}, but converts to the specified internal
11670 format. (No formats other than FORMAT_DEFAULT are currently
11671 implemented, and you get an assertion failure if you try.)
11672
11673 Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out);
11674 Make a @code{malloc()} copy of the data in the Eistring, using the
11675 default internal format. This is a real function. No lvalues
11676 passed in. Returns the new data, and stores the length (not
11677 including the terminating zero) using INTLEN_OUT, unless it's
11678 a NULL pointer.
11679
11680 Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt,
11681 Bytecount *intlen_out, Lisp_Object object);
11682 Like @code{eicpyout_malloc()}, but converts to the specified internal
11683 format. (No formats other than FORMAT_DEFAULT are currently
11684 implemented, and you get an assertion failure if you try.)
11685
11686
11687 **********************************************
11688 * Moving to the heap *
11689 **********************************************
11690
11691 void eito_malloc (Eistring *eistr);
11692 Move this Eistring to the heap. Its data will be stored in a
11693 @code{malloc()}ed block rather than the stack. Subsequent changes to
11694 this Eistring will @code{realloc()} the block as necessary. Use this
11695 when you want the Eistring to remain in scope past the end of
11696 this function call. You will have to manually free the data
11697 in the Eistring using @code{eifree()}.
11698
11699 void eito_alloca (Eistring *eistr);
11700 Move this Eistring back to the stack, if it was moved to the
11701 heap with @code{eito_malloc()}. This will automatically free any
11702 heap-allocated data.
11703
11704
11705
11706 **********************************************
11707 * Retrieving the length *
11708 **********************************************
11709
11710 Bytecount eilen (Eistring *eistr);
11711 Return the length of the internal data, in bytes. See also
11712 @code{eiextlen()}, below.
11713 Charcount eicharlen (Eistring *eistr);
11714 Return the length of the internal data, in characters.
11715
11716
11717 **********************************************
11718 * Working with positions *
11719 **********************************************
11720
11721 Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos);
11722 Convert a char offset to a byte offset.
11723 Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos);
11724 Convert a byte offset to a char offset.
11725 Bytecount eiincpos (Eistring *eistr, Bytecount bytepos);
11726 Increment the given position by one character.
11727 Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
11728 Increment the given position by N characters.
11729 Bytecount eidecpos (Eistring *eistr, Bytecount bytepos);
11730 Decrement the given position by one character.
11731 Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
11732 Deccrement the given position by N characters.
11733
11734
11735 **********************************************
11736 * Getting the character at a position *
11737 **********************************************
11738
11739 Ichar eigetch (Eistring *eistr, Bytecount bytepos);
11740 Return the character at a particular byte offset.
11741 Ichar eigetch_char (Eistring *eistr, Charcount charpos);
11742 Return the character at a particular character offset.
11743
11744
11745 **********************************************
11746 * Setting the character at a position *
11747 **********************************************
11748
11749 Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr);
11750 Set the character at a particular byte offset.
11751 Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr);
11752 Set the character at a particular character offset.
11753
11754
11755 **********************************************
11756 * Concatenation *
11757 **********************************************
11758
11759 void eicat_* (Eistring *eistr, ...);
11760 Concatenate onto the end of the Eistring, with data coming from the
11761 same places as above:
11762
11763 void eicat_ei (Eistring *eistr, Eistring *eistr2);
11764 ... from another Eistring.
11765 void eicat_c (Eistring *eistr, Ascbyte *c_string);
11766 ... from an ASCII null-terminated string. Non-ASCII characters in
11767 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
11768 void eicat_raw (ei, const Ibyte *data, Bytecount len);
11769 ... from raw internal-format data in the default internal format.
11770 void eicat_rawz (ei, const Ibyte *data);
11771 ... from raw internal-format data in the default internal format
11772 that is "null-terminated" (the meaning of this depends on the nature
11773 of the default internal format).
11774 void eicat_lstr (ei, Lisp_Object lisp_string);
11775 ... from a Lisp_Object string.
11776 void eicat_ch (ei, Ichar ch);
11777 ... from an Ichar.
11778
11779 All except the first variety are convenience functions.
11780 n the general case, create another Eistring from the source.)
11781
11782
11783 **********************************************
11784 * Replacement *
11785 **********************************************
11786
11787 void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff,
11788 Bytecount len, Charcount charlen, ...);
11789 Replace a section of the Eistring, specifically:
11790
11791 void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff,
11792 Bytecount len, Charcount charlen, Eistring *eistr2);
11793 ... with another Eistring.
11794 void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff,
11795 Bytecount len, Charcount charlen, Ascbyte *c_string);
11796 ... with an ASCII null-terminated string. Non-ASCII characters in
11797 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined).
11798 void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff,
11799 Bytecount len, Charcount charlen, Ichar ch);
11800 ... with an Ichar.
11801
11802 void eidel (Eistring *eistr, Bytecount off, Charcount charoff,
11803 Bytecount len, Charcount charlen);
11804 Delete a section of the Eistring.
11805
11806
11807 **********************************************
11808 * Converting to an external format *
11809 **********************************************
11810
11811 void eito_external (Eistring *eistr, Lisp_Object codesys);
11812 Convert the Eistring to an external format and store the result
11813 in the string. NOTE: Further changes to the Eistring will @strong{NOT}
11814 change the external data stored in the string. You will have to
11815 call @code{eito_external()} again in such a case if you want the external
11816 data.
11817
11818 Extbyte *eiextdata (Eistring *eistr);
11819 Return a pointer to the external data stored in the Eistring as
11820 a result of a prior call to @code{eito_external()}.
11821
11822 Bytecount eiextlen (Eistring *eistr);
11823 Return the length in bytes of the external data stored in the
11824 Eistring as a result of a prior call to @code{eito_external()}.
11825
11826
11827 **********************************************
11828 * Searching in the Eistring for a character *
11829 **********************************************
11830
11831 Bytecount eichr (Eistring *eistr, Ichar chr);
11832 Charcount eichr_char (Eistring *eistr, Ichar chr);
11833 Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off,
11834 Charcount charoff);
11835 Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
11836 Charcount charoff);
11837 Bytecount eirchr (Eistring *eistr, Ichar chr);
11838 Charcount eirchr_char (Eistring *eistr, Ichar chr);
11839 Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off,
11840 Charcount charoff);
11841 Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
11842 Charcount charoff);
11843
11844
11845 **********************************************
11846 * Searching in the Eistring for a string *
11847 **********************************************
11848
11849 Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2);
11850 Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2);
11851 Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
11852 Charcount charoff);
11853 Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2,
11854 Bytecount off, Charcount charoff);
11855 Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2);
11856 Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2);
11857 Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
11858 Charcount charoff);
11859 Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2,
11860 Bytecount off, Charcount charoff);
11861
11862 Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string);
11863 Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string);
11864 Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off,
11865 Charcount charoff);
11866 Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string,
11867 Bytecount off, Charcount charoff);
11868 Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string);
11869 Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string);
11870 Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string,
11871 Bytecount off, Charcount charoff);
11872 Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string,
11873 Bytecount off, Charcount charoff);
11874
11875
11876 **********************************************
11877 * Comparison *
11878 **********************************************
11879
11880 int eicmp_* (Eistring *eistr, ...);
11881 int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
11882 Bytecount len, Charcount charlen, ...);
11883 int eicasecmp_* (Eistring *eistr, ...);
11884 int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
11885 Bytecount len, Charcount charlen, ...);
11886 int eicasecmp_i18n_* (Eistring *eistr, ...);
11887 int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
11888 Bytecount len, Charcount charlen, ...);
11889
11890 Compare the Eistring with the other data. Return value same as
11891 from strcmp. The `*' is either `ei' for another Eistring (in
11892 which case `...' is an Eistring), or `c' for a pure-ASCII string
11893 (in which case `...' is a pointer to that string). For anything
11894 more complex, first create an Eistring out of the source.
11895 Comparison is either simple (`eicmp_...'), ASCII case-folding
11896 (`eicasecmp_...'), or multilingual case-folding
11897 (`eicasecmp_i18n_...).
11898
11899
11900 More specifically, the prototypes are:
11901
11902 int eicmp_ei (Eistring *eistr, Eistring *eistr2);
11903 int eicmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
11904 Bytecount len, Charcount charlen, Eistring *eistr2);
11905 int eicasecmp_ei (Eistring *eistr, Eistring *eistr2);
11906 int eicasecmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
11907 Bytecount len, Charcount charlen, Eistring *eistr2);
11908 int eicasecmp_i18n_ei (Eistring *eistr, Eistring *eistr2);
11909 int eicasecmp_i18n_off_ei (Eistring *eistr, Bytecount off,
11910 Charcount charoff, Bytecount len,
11911 Charcount charlen, Eistring *eistr2);
11912
11913 int eicmp_c (Eistring *eistr, Ascbyte *c_string);
11914 int eicmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
11915 Bytecount len, Charcount charlen, Ascbyte *c_string);
11916 int eicasecmp_c (Eistring *eistr, Ascbyte *c_string);
11917 int eicasecmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
11918 Bytecount len, Charcount charlen,
11919 Ascbyte *c_string);
11920 int eicasecmp_i18n_c (Eistring *eistr, Ascbyte *c_string);
11921 int eicasecmp_i18n_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
11922 Bytecount len, Charcount charlen,
11923 Ascbyte *c_string);
11924
11925
11926 **********************************************
11927 * Case-changing the Eistring *
11928 **********************************************
11929
11930 void eilwr (Eistring *eistr);
11931 Convert all characters in the Eistring to lowercase.
11932 void eiupr (Eistring *eistr);
11933 Convert all characters in the Eistring to uppercase.
11934 @end example
11935
11936 @node Coding for Mule, CCL, Internal Text API's, Multilingual Support
11937 @section Coding for Mule
11938 @cindex coding for Mule
11939 @cindex Mule, coding for
11940
11941 Although Mule support is not compiled by default in XEmacs, many people
11942 are using it, and we consider it crucial that new code works correctly
11943 with multibyte characters. This is not hard; it is only a matter of
11944 following several simple user-interface guidelines. Even if you never
11945 compile with Mule, with a little practice you will find it quite easy
11946 to code Mule-correctly.
11947
11948 Note that these guidelines are not necessarily tied to the current Mule
11949 implementation; they are also a good idea to follow on the grounds of
11950 code generalization for future I18N work.
11951
11952 @menu
11953 * Character-Related Data Types::
11954 * Working With Character and Byte Positions::
11955 * Conversion to and from External Data::
11956 * General Guidelines for Writing Mule-Aware Code::
11957 * An Example of Mule-Aware Code::
11958 * Mule-izing Code::
11959 @end menu
11960
11961 @node Character-Related Data Types, Working With Character and Byte Positions, Coding for Mule, Coding for Mule
11962 @subsection Character-Related Data Types
11963 @cindex character-related data types
11964 @cindex data types, character-related
11965
11966 First, let's review the basic character-related datatypes used by
11967 XEmacs. Note that some of the separate @code{typedef}s are not
11968 mandatory, but they improve clarity of code a great deal, because one
11969 glance at the declaration can tell the intended use of the variable.
11970
11971 @table @code
11972 @item Ichar
11973 @cindex Ichar
11974 An @code{Ichar} holds a single Emacs character.
11975
11976 Obviously, the equality between characters and bytes is lost in the Mule
11977 world. Characters can be represented by one or more bytes in the
11978 buffer, and @code{Ichar} is a C type large enough to hold any
11979 character. (This currently isn't quite true for ISO 10646, which
11980 defines a character as a 31-bit non-negative quantity, while XEmacs
11981 characters are only 30-bits. This is irrelevant, unless you are
11982 considering using the ISO 10646 private groups to support really large
11983 private character sets---in particular, the Mule character set!---in
11984 a version of XEmacs using Unicode internally.)
11985
11986 Without Mule support, an @code{Ichar} is equivalent to an
11987 @code{unsigned char}. [[This doesn't seem to be true; @file{lisp.h}
11988 unconditionally @samp{typedef}s @code{Ichar} to @code{int}.]]
11989
11990 @item Ibyte
11991 @cindex Ibyte
11992 The data representing the text in a buffer or string is logically a set
11993 of @code{Ibyte}s.
11994
11995 XEmacs does not work with the same character formats all the time; when
11996 reading characters from the outside, it decodes them to an internal
11997 format, and likewise encodes them when writing. @code{Ibyte} (in fact
11998 @code{unsigned char}) is the basic unit of XEmacs internal buffers and
11999 strings format. An @code{Ibyte *} is the type that points at text
12000 encoded in the variable-width internal encoding.
12001
12002 One character can correspond to one or more @code{Ibyte}s. In the
12003 current Mule implementation, an ASCII character is represented by the
12004 same @code{Ibyte}, and other characters are represented by a sequence
12005 of two or more @code{Ibyte}s. (This will also be true of an
12006 implementation using UTF-8 as the internal encoding. In fact, only code
12007 that implements character code conversions and a very few macros used to
12008 implement motion by whole characters will notice the difference between
12009 UTF-8 and the Mule encoding.)
12010
12011 Without Mule support, there are exactly 256 characters, implicitly
12012 Latin-1, and each character is represented using one @code{Ibyte}, and
12013 there is a one-to-one correspondence between @code{Ibyte}s and
12014 @code{Ichar}s.
12015
12016 @item Charxpos
12017 @item Charbpos
12018 @itemx Charcount
12019 @cindex Charxpos
12020 @cindex Charbpos
12021 @cindex Charcount
12022 A @code{Charbpos} represents a character position in a buffer. A
12023 @code{Charcount} represents a number (count) of characters. Logically,
12024 subtracting two @code{Charbpos} values yields a @code{Charcount} value.
12025 When representing a character position in a string, we just use
12026 @code{Charcount} directly. The reason for having a separate typedef for
12027 buffer positions is that they are 1-based, whereas string positions are
12028 0-based and hence string counts and positions can be freely intermixed (a
12029 string position is equivalent to the count of characters from the
12030 beginning). When representing a character position that could be either
12031 in a buffer or string (for example, in the extent code), @code{Charxpos}
12032 is used. Although all of these are @code{typedef}ed to
12033 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make
12034 it clear what sort of position is being used.
12035
12036 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the
12037 only ones that are ever visible to Lisp.
12038
12039 @item Bytexpos
12040 @itemx Bytecount
12041 @cindex Bytebpos
12042 @cindex Bytecount
12043 A @code{Bytebpos} represents a byte position in a buffer. A
12044 @code{Bytecount} represents the distance between two positions, in
12045 bytes. Byte positions in strings use @code{Bytecount}, and for byte
12046 positions that can be either in a buffer or string, @code{Bytexpos} is
12047 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and
12048 @code{Bytecount} is the same as the relationship between
12049 @code{Charxpos}, @code{Charbpos} and @code{Charcount}.
12050
12051 @item Extbyte
12052 @cindex Extbyte
12053 When dealing with the outside world, XEmacs works with @code{Extbyte}s,
12054 which are equivalent to @code{char}. The distance between two
12055 @code{Extbyte}s is a @code{Bytecount}, since external text is a
12056 byte-by-byte encoding. Extbytes occur mainly at the transition point
12057 between internal text and external functions. XEmacs code should not,
12058 if it can possibly avoid it, do any actual manipulation using external
12059 text, since its format is completely unpredictable (it might not even be
12060 ASCII-compatible).
12061 @end table
12062
12063 @node Working With Character and Byte Positions, Conversion to and from External Data, Character-Related Data Types, Coding for Mule
12064 @subsection Working With Character and Byte Positions
12065 @cindex character and byte positions, working with
12066 @cindex byte positions, working with character and
12067 @cindex positions, working with character and byte
12068
12069 Now that we have defined the basic character-related types, we can look
12070 at the macros and functions designed for work with them and for
12071 conversion between them. Most of these macros are defined in
12072 @file{buffer.h}, and we don't discuss all of them here, but only the
12073 most important ones. Examining the existing code is the best way to
12074 learn about them.
12075
12076 @table @code
12077 @item MAX_ICHAR_LEN
12078 @cindex MAX_ICHAR_LEN
12079 This preprocessor constant is the maximum number of buffer bytes to
12080 represent an Emacs character in the variable width internal encoding.
12081 It is useful when allocating temporary strings to keep a known number of
12082 characters. For instance:
12083
12084 @example
12085 @group
12086 @{
12087 Charcount cclen;
12088 ...
12089 @{
12090 /* Allocate place for @var{cclen} characters. */
12091 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN);
12092 ...
12093 @end group
12094 @end example
12095
12096 If you followed the previous section, you can guess that, logically,
12097 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces
12098 a @code{Bytecount} value.
12099
12100 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4.
12101 Without Mule, it is 1. In a mature Unicode-based XEmacs, it will also
12102 be 4 (since all Unicode characters can be encoded in UTF-8 in 4 bytes or
12103 less), but some versions may use up to 6, in order to use the large
12104 private space provided by ISO 10646 to ``mirror'' the Mule code space.
12105
12106 @item itext_ichar
12107 @itemx set_itext_ichar
12108 @cindex itext_ichar
12109 @cindex set_itext_ichar
12110 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and
12111 returns the @code{Ichar} stored at that position. If it were a
12112 function, its prototype would be:
12113
12114 @example
12115 Ichar itext_ichar (Ibyte *p);
12116 @end example
12117
12118 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte
12119 position. It returns the number of bytes stored:
12120
12121 @example
12122 Bytecount set_itext_ichar (Ibyte *p, Ichar c);
12123 @end example
12124
12125 It is important to note that @code{set_itext_ichar} is safe only for
12126 appending a character at the end of a buffer, not for overwriting a
12127 character in the middle. This is because the width of characters
12128 varies, and @code{set_itext_ichar} cannot resize the string if it
12129 writes, say, a two-byte character where a single-byte character used to
12130 reside.
12131
12132 A typical use of @code{set_itext_ichar} can be demonstrated by this
12133 example, which copies characters from buffer @var{buf} to a temporary
12134 string of Ibytes.
12135
12136 @example
12137 @group
12138 @{
12139 Charbpos pos;
12140 for (pos = beg; pos < end; pos++)
12141 @{
12142 Ichar c = BUF_FETCH_CHAR (buf, pos);
12143 p += set_itext_ichar (buf, c);
12144 @}
12145 @}
12146 @end group
12147 @end example
12148
12149 Note how @code{set_itext_ichar} is used to store the @code{Ichar}
12150 and increment the counter, at the same time.
12151
12152 @item INC_IBYTEPTR
12153 @itemx DEC_IBYTEPTR
12154 @cindex INC_IBYTEPTR
12155 @cindex DEC_IBYTEPTR
12156 These two macros increment and decrement an @code{Ibyte} pointer,
12157 respectively. They will adjust the pointer by the appropriate number of
12158 bytes according to the byte length of the character stored there. Both
12159 macros assume that the memory address is located at the beginning of a
12160 valid character.
12161
12162 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)}
12163 simply expand to @code{p++} and @code{p--}, respectively.
12164
12165 @item bytecount_to_charcount
12166 @cindex bytecount_to_charcount
12167 Given a pointer to a text string and a length in bytes, return the
12168 equivalent length in characters.
12169
12170 @example
12171 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc);
12172 @end example
12173
12174 @item charcount_to_bytecount
12175 @cindex charcount_to_bytecount
12176 Given a pointer to a text string and a length in characters, return the
12177 equivalent length in bytes.
12178
12179 @example
12180 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc);
12181 @end example
12182
12183 @item itext_n_addr
12184 @cindex itext_n_addr
12185 Return a pointer to the beginning of the character offset @var{cc} (in
12186 characters) from @var{p}.
12187
12188 @example
12189 Ibyte *itext_n_addr (Ibyte *p, Charcount cc);
12190 @end example
12191 @end table
12192
12193 @node Conversion to and from External Data, General Guidelines for Writing Mule-Aware Code, Working With Character and Byte Positions, Coding for Mule
12194 @subsection Conversion to and from External Data
12195 @cindex conversion to and from external data
12196 @cindex external data, conversion to and from
12197
12198 When an external function, such as a C library function, returns a
12199 @code{char} pointer, you should almost never treat it as @code{Ibyte}.
12200 This is because these returned strings may contain 8bit characters which
12201 can be misinterpreted by XEmacs, and cause a crash. Likewise, when
12202 exporting a piece of internal text to the outside world, you should
12203 always convert it to an appropriate external encoding, lest the internal
12204 stuff (such as the infamous \201 characters) leak out.
12205
12206 The interface to conversion between the internal and external
12207 representations of text are the numerous conversion macros defined in
12208 @file{buffer.h}. There used to be a fixed set of external formats
12209 supported by these macros, but now any coding system can be used with
12210 them. The coding system alias mechanism is used to create the
12211 following logical coding systems, which replace the fixed external
12212 formats. The (dontusethis-set-symbol-value-handler) mechanism was
12213 enhanced to make this possible (more work on that is needed).
12214
12215 Often useful coding systems:
12216
12217 @table @code
12218 @item Qbinary
12219 This is the simplest format and is what we use in the absence of a more
12220 appropriate format. This converts according to the @code{binary} coding
12221 system:
12222
12223 @enumerate a
12224 @item
12225 On input, bytes 0--255 are converted into (implicitly Latin-1)
12226 characters 0--255. A non-Mule xemacs doesn't really know about
12227 different character sets and the fonts to display them, so the bytes can
12228 be treated as text in different 1-byte encodings by simply setting the
12229 appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual
12230 editor if, for example, different fonts are used to display text in
12231 different buffers, faces, or windows. The specifier mechanism gives the
12232 user complete control over this kind of behavior.
12233 @item
12234 On output, characters 0--255 are converted into bytes 0--255 and other
12235 characters are converted into @samp{~}.
12236 @end enumerate
12237
12238 @item Qnative
12239 Format used for the external Unix environment---@code{argv[]}, stuff
12240 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc.
12241 This is encoded according to the encoding specified by the current locale.
12242 [[This is dangerous; current locale is user preference, and the system
12243 is probably going to be something else. Is there anything we can do
12244 about it?]]
12245
12246 @item Qfile_name
12247 Format used for filenames. This is normally the same as @code{Qnative},
12248 but the two should be distinguished for clarity and possible future
12249 separation -- and also because @code{Qfile_name} can be changed using either
12250 the @code{file-name-coding-system} or @code{pathname-coding-system} (now
12251 obsolete) variables.
12252
12253 @item Qctext
12254 Compound-text format. This is the standard X11 format used for data
12255 stored in properties, selections, and the like. This is an 8-bit
12256 no-lock-shift ISO2022 coding system. This is a real coding system,
12257 unlike @code{Qfile_name}, which is user-definable.
12258
12259 @item Qmswindows_tstr
12260 Used for external data in all MS Windows functions that are declared to
12261 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either
12262 @code{Qmswindows_multibyte} (a locale-specific encoding, same as
12263 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether
12264 XEmacs is being run under Windows 9X or Windows NT/2000/XP.
12265 @end table
12266
12267 Many other coding systems are provided by default.
12268
12269 There are two fundamental macros to convert between external and
12270 internal format, as well as various convenience macros to simplify the
12271 most common operations.
12272
12273 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and
12274 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments
12275 each of these receives are a source type, a source, a sink type, a sink,
12276 and a coding system (or a symbol naming a coding system).
12277
12278 A typical call looks like
12279 @example
12280 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);
12281 @end example
12282
12283 which means that the contents of the lisp string @code{str} are written
12284 to a malloc'ed memory area which will be pointed to by @code{ptr}, after
12285 the function returns. The conversion will be done using the
12286 @code{file-name} coding system, which will be controlled by the user
12287 indirectly by setting or binding the variable
12288 @code{file-name-coding-system}.
12289
12290 Some sources and sinks require two C variables to specify. We use some
12291 preprocessor magic to allow different source and sink types, and even
12292 different numbers of arguments to specify different types of sources and
12293 sinks.
12294
12295 So we can have a call that looks like
12296 @example
12297 TO_INTERNAL_FORMAT (DATA, (ptr, len),
12298 MALLOC, (ptr, len),
12299 coding_system);
12300 @end example
12301
12302 The parenthesized argument pairs are required to make the preprocessor
12303 magic work.
12304
12305 Here are the different source and sink types:
12306
12307 @table @code
12308 @item @code{DATA, (ptr, len),}
12309 input data is a fixed buffer of size @var{len} at address @var{ptr}
12310 @item @code{ALLOCA, (ptr, len),}
12311 output data is placed in an @code{alloca()}ed buffer of size @var{len} pointed to by @var{ptr}
12312 @item @code{MALLOC, (ptr, len),}
12313 output data is in a @code{malloc()}ed buffer of size @var{len} pointed to by @var{ptr}
12314 @item @code{C_STRING_ALLOCA, ptr,}
12315 equivalent to @code{ALLOCA (ptr, len_ignored)} on output.
12316 @item @code{C_STRING_MALLOC, ptr,}
12317 equivalent to @code{MALLOC (ptr, len_ignored)} on output
12318 @item @code{C_STRING, ptr,}
12319 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input
12320 @item @code{LISP_STRING, string,}
12321 input or output is a Lisp_Object of type string
12322 @item @code{LISP_BUFFER, buffer,}
12323 output is written to @code{(point)} in lisp buffer @var{buffer}
12324 @item @code{LISP_LSTREAM, lstream,}
12325 input or output is a Lisp_Object of type lstream
12326 @item @code{LISP_OPAQUE, object,}
12327 input or output is a Lisp_Object of type opaque
12328 @end table
12329
12330 A source type of @code{C_STRING} or a sink type of
12331 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where
12332 the external API is not '\0'-byte-clean -- i.e. it expects strings to be
12333 terminated with a null byte. For external API's that are in fact
12334 '\0'-byte-clean, we should of course not use these.
12335
12336 The sinks to be specified must be lvalues, unless they are the lisp
12337 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}.
12338
12339 There is no problem using the same lvalue for source and sink.
12340
12341 Garbage collection is inhibited during these conversion operations, so
12342 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}.
12343
12344 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the
12345 resulting text is stored in a stack-allocated buffer, which is
12346 automatically freed on returning from the function. However, the sink
12347 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed
12348 memory. The caller is responsible for freeing this memory using
12349 @code{xfree()}.
12350
12351 Note that it doesn't make sense for @code{LISP_STRING} to be a source
12352 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}.
12353 You'll get an assertion failure if you try.
12354
12355 99% of conversions involve raw data or Lisp strings as both source and
12356 sink, and usually data is output as @code{alloca()}, or sometimes
12357 @code{xmalloc()}. For this reason, convenience macros are defined for
12358 many types of conversions involving raw data and/or Lisp strings,
12359 especially when the output is an @code{alloca()}ed string. (When the
12360 destination is a Lisp string, there are other functions that should be
12361 used instead -- @code{build_ext_string()} and @code{make_ext_string()},
12362 for example.) The convenience macros are of two types -- the older kind
12363 that store the result into a specified variable, and the newer kind that
12364 return the result. The newer kind of macros don't exist when the output
12365 is sized data, because that would have two return values. NOTE: All
12366 convenience macros are ultimately defined in terms of
12367 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any
12368 comments above about the workings of these macros also apply to all
12369 convenience macros.
12370
12371 A typical old-style convenience macro is
12372
12373 @example
12374 C_STRING_TO_EXTERNAL (in, out, codesys);
12375 @end example
12376
12377 This is equivalent to
12378
12379 @example
12380 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys);
12381 @end example
12382
12383 but is easier to write and somewhat clearer, since it clearly identifies
12384 the arguments without the clutter of having the preprocessor types mixed
12385 in.
12386
12387 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src,
12388 codesys)}, which @emph{returns} the converted data (still in
12389 @code{alloca()} space). This is far more convenient for most
12390 operations.
12391
12392 @node General Guidelines for Writing Mule-Aware Code, An Example of Mule-Aware Code, Conversion to and from External Data, Coding for Mule
12393 @subsection General Guidelines for Writing Mule-Aware Code
12394 @cindex writing Mule-aware code, general guidelines for
12395 @cindex Mule-aware code, general guidelines for writing
12396 @cindex code, general guidelines for writing Mule-aware
12397
12398 This section contains some general guidance on how to write Mule-aware
12399 code, as well as some pitfalls you should avoid.
12400
12401 @table @emph
12402 @item Never use @code{char} and @code{char *}.
12403 In XEmacs, the use of @code{char} and @code{char *} is almost always a
12404 mistake. If you want to manipulate an Emacs character from ``C'', use
12405 @code{Ichar}. If you want to examine a specific octet in the internal
12406 format, use @code{Ibyte}. If you want a Lisp-visible character, use a
12407 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move
12408 through the internal text, use @code{Ibyte *}. Also note that you
12409 almost certainly do not need @code{Ichar *}. Other typedefs to clarify
12410 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary},
12411 @code{UChar_Binary}, and @code{CIbyte}.
12412
12413 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}.
12414 The whole point of using different types is to avoid confusion about the
12415 use of certain variables. Lest this effect be nullified, you need to be
12416 careful about using the right types.
12417
12418 @item Always convert external data
12419 It is extremely important to always convert external data, because
12420 XEmacs can crash if unexpected 8-bit sequences are copied to its internal
12421 buffers literally.
12422
12423 This means that when a system function, such as @code{readdir}, returns
12424 a string, you normally need to convert it using one of the conversion macros
12425 described in the previous chapter, before passing it further to Lisp.
12426
12427 Actually, most of the basic system functions that accept '\0'-terminated
12428 string arguments, like @code{stat()} and @code{open()}, have
12429 @strong{encapsulated} equivalents that do the internal to external
12430 conversion themselves. The encapsulated equivalents have a @code{qxe_}
12431 prefix and have string arguments of type @code{Ibyte *}, and you can
12432 pass internally encoded data to them, often from a Lisp string using
12433 @code{XSTRING_DATA}. (A better design might be to provide versions that
12434 accept Lisp strings directly.) [[Really? Then they'd either take
12435 @code{Lisp_Object}s and need to check type, or they'd take
12436 @code{Lisp_String}s, and violate the rules about passing any of the
12437 specific Lisp types.]]
12438
12439 Also note that many internal functions, such as @code{make_string},
12440 accept Ibytes, which removes the need for them to convert the data they
12441 receive. This increases efficiency because that way external data needs
12442 to be decoded only once, when it is read. After that, it is passed
12443 around in internal format.
12444
12445 @item Do all work in internal format
12446 External-formatted data is completely unpredictable in its format. It
12447 may be fixed-width Unicode (not even ASCII compatible); it may be a
12448 modal encoding, in
12449 which case some occurrences of (e.g.) the slash character may be part of
12450 two-byte Asian-language characters, and a naive attempt to split apart a
12451 pathname by slashes will fail; etc. Internal-format text should be
12452 converted to external format only at the point where an external API is
12453 actually called, and the first thing done after receiving
12454 external-format text from an external API should be to convert it to
12455 internal text.
12456 @end table
12457
12458 @node An Example of Mule-Aware Code, Mule-izing Code, General Guidelines for Writing Mule-Aware Code, Coding for Mule
12459 @subsection An Example of Mule-Aware Code
12460 @cindex code, an example of Mule-aware
12461 @cindex Mule-aware code, an example of
12462
12463 As an example of Mule-aware code, we will analyze the @code{string}
12464 function, which conses up a Lisp string from the character arguments it
12465 receives. Here is the definition, pasted from @code{alloc.c}:
12466
12467 @example
12468 @group
12469 DEFUN ("string", Fstring, 0, MANY, 0, /*
12470 Concatenate all the argument characters and make the result a string.
12471 */
12472 (int nargs, Lisp_Object *args))
12473 @{
12474 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN);
12475 Ibyte *p = storage;
12476
12477 for (; nargs; nargs--, args++)
12478 @{
12479 Lisp_Object lisp_char = *args;
12480 CHECK_CHAR_COERCE_INT (lisp_char);
12481 p += set_itext_ichar (p, XCHAR (lisp_char));
12482 @}
12483 return make_string (storage, p - storage);
12484 @}
12485 @end group
12486 @end example
12487
12488 Now we can analyze the source line by line.
12489
12490 Obviously, string will be as long as there are arguments to the
12491 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs}
12492 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs}
12493 @code{Ichar}s to fit in the string.
12494
12495 Then, the loop checks that each element is a character, converting
12496 integers in the process. Like many other functions in XEmacs, this
12497 function silently accepts integers where characters are expected, for
12498 historical and compatibility reasons. Unless you know what you are
12499 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)}
12500 extracts the @code{Ichar} from the @code{Lisp_Object}, and
12501 @code{set_itext_ichar} stores it to storage, increasing @code{p} in
12502 the process.
12503
12504 Other instructive examples of correct coding under Mule can be found all
12505 over the XEmacs code. For starters, I recommend
12506 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have
12507 understood this section of the manual and studied the examples, you can
12508 proceed writing new Mule-aware code.
12509
12510 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule
12511 @subsection Mule-izing Code
12512
12513 A lot of code is written without Mule in mind, and needs to be made
12514 Mule-correct or "Mule-ized". There is really no substitute for
12515 line-by-line analysis when doing this, but the following checklist can
12516 help:
12517
12518 @itemize @bullet
12519 @item
12520 Check all uses of @code{XSTRING_DATA}.
12521 @item
12522 Check all uses of @code{build_string} and @code{make_string}.
12523 @item
12524 Check all uses of @code{tolower} and @code{toupper}.
12525 @item
12526 Check object print methods.
12527 @item
12528 Check for use of functions such as @code{write_c_string},
12529 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}.
12530 @item
12531 Check all occurrences of @code{char} and correct to one of the other
12532 typedefs described above.
12533 @item
12534 Check all existing uses of @code{TO_EXTERNAL_FORMAT},
12535 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for
12536 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}).
12537 @item
12538 In Windows code, string literals may need to be encapsulated with @code{XETEXT}.
12539 @end itemize
12540
12541 @node CCL, Modules for Internationalization, Coding for Mule, Multilingual Support
12542 @section CCL
12543 @cindex CCL
12544
12545 @example
12546 MACHINE CODE:
12547
12548 The machine code consists of a vector of 32-bit words.
12549 The first such word specifies the start of the EOF section of the code;
12550 this is the code executed to handle any stuff that needs to be done
12551 (e.g. designating back to ASCII and left-to-right mode) after all
12552 other encoded/decoded data has been written out. This is not used for
12553 charset CCL programs.
12554
12555 REGISTER: 0..7 -- referred by RRR or rrr
12556
12557 OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT
12558 TTTTT (5-bit): operator type
12559 RRR (3-bit): register number
12560 XXXXXXXXXXXXXXXX (15-bit):
12561 CCCCCCCCCCCCCCC: constant or address
12562 000000000000rrr: register number
12563
12564 AAAA: 00000 +
12565 00001 -
12566 00010 *
12567 00011 /
12568 00100 %
12569 00101 &
12570 00110 |
12571 00111 ~
12572
12573 01000 <<
12574 01001 >>
12575 01010 <8
12576 01011 >8
12577 01100 //
12578 01101 not used
12579 01110 not used
12580 01111 not used
12581
12582 10000 <
12583 10001 >
12584 10010 ==
12585 10011 <=
12586 10100 >=
12587 10101 !=
12588
12589 OPERATORS: TTTTT RRR XX..
12590
12591 SetCS: 00000 RRR C...C RRR = C...C
12592 SetCL: 00001 RRR ..... RRR = c...c
12593 c.............c
12594 SetR: 00010 RRR ..rrr RRR = rrr
12595 SetA: 00011 RRR ..rrr RRR = array[rrr]
12596 C.............C size of array = C...C
12597 c.............c contents = c...c
12598
12599 Jump: 00100 000 c...c jump to c...c
12600 JumpCond: 00101 RRR c...c if (!RRR) jump to c...c
12601 WriteJump: 00110 RRR c...c Write1 RRR, jump to c...c
12602 WriteReadJump: 00111 RRR c...c Write1, Read1 RRR, jump to c...c
12603 WriteCJump: 01000 000 c...c Write1 C...C, jump to c...c
12604 C...C
12605 WriteCReadJump: 01001 RRR c...c Write1 C...C, Read1 RRR,
12606 C.............C and jump to c...c
12607 WriteSJump: 01010 000 c...c WriteS, jump to c...c
12608 C.............C
12609 S.............S
12610 ...
12611 WriteSReadJump: 01011 RRR c...c WriteS, Read1 RRR, jump to c...c
12612 C.............C
12613 S.............S
12614 ...
12615 WriteAReadJump: 01100 RRR c...c WriteA, Read1 RRR, jump to c...c
12616 C.............C size of array = C...C
12617 c.............c contents = c...c
12618 ...
12619 Branch: 01101 RRR C...C if (RRR >= 0 && RRR < C..)
12620 c.............c branch to (RRR+1)th address
12621 Read1: 01110 RRR ... read 1-byte to RRR
12622 Read2: 01111 RRR ..rrr read 2-byte to RRR and rrr
12623 ReadBranch: 10000 RRR C...C Read1 and Branch
12624 c.............c
12625 ...
12626 Write1: 10001 RRR ..... write 1-byte RRR
12627 Write2: 10010 RRR ..rrr write 2-byte RRR and rrr
12628 WriteC: 10011 000 ..... write 1-char C...CC
12629 C.............C
12630 WriteS: 10100 000 ..... write C..-byte of string
12631 C.............C
12632 S.............S
12633 ...
12634 WriteA: 10101 RRR ..... write array[RRR]
12635 C.............C size of array = C...C
12636 c.............c contents = c...c
12637 ...
12638 End: 10110 000 ..... terminate the execution
12639
12640 SetSelfCS: 10111 RRR C...C RRR AAAAA= C...C
12641 ..........AAAAA
12642 SetSelfCL: 11000 RRR ..... RRR AAAAA= c...c
12643 c.............c
12644 ..........AAAAA
12645 SetSelfR: 11001 RRR ..Rrr RRR AAAAA= rrr
12646 ..........AAAAA
12647 SetExprCL: 11010 RRR ..Rrr RRR = rrr AAAAA c...c
12648 c.............c
12649 ..........AAAAA
12650 SetExprR: 11011 RRR ..rrr RRR = rrr AAAAA Rrr
12651 ............Rrr
12652 ..........AAAAA
12653 JumpCondC: 11100 RRR c...c if !(RRR AAAAA C..) jump to c...c
12654 C.............C
12655 ..........AAAAA
12656 JumpCondR: 11101 RRR c...c if !(RRR AAAAA rrr) jump to c...c
12657 ............rrr
12658 ..........AAAAA
12659 ReadJumpCondC: 11110 RRR c...c Read1 and JumpCondC
12660 C.............C
12661 ..........AAAAA
12662 ReadJumpCondR: 11111 RRR c...c Read1 and JumpCondR
12663 ............rrr
12664 ..........AAAAA
12665 @end example
12666
12667 @node Modules for Internationalization, , CCL, Multilingual Support
12668 @section Modules for Internationalization
12669 @cindex modules for internationalization
12670 @cindex internationalization, modules for
12671
12672 @example
12673 @file{mule-canna.c}
12674 @file{mule-ccl.c}
12675 @file{mule-charset.c}
12676 @file{mule-charset.h}
12677 @file{file-coding.c}
12678 @file{file-coding.h}
12679 @file{mule-coding.c}
12680 @file{mule-mcpath.c}
12681 @file{mule-mcpath.h}
12682 @file{mule-wnnfns.c}
12683 @file{mule.c}
12684 @end example
12685
12686 These files implement the MULE (Asian-language) support. Note that MULE
12687 actually provides a general interface for all sorts of languages, not
12688 just Asian languages (although they are generally the most complicated
12689 to support). This code is still in beta.
12690
12691 @file{mule-charset.*} and @file{file-coding.*} provide the heart of the
12692 XEmacs MULE support. @file{mule-charset.*} implements the @dfn{charset}
12693 Lisp object type, which encapsulates a character set (an ordered one- or
12694 two-dimensional set of characters, such as US ASCII or JISX0208 Japanese
12695 Kanji).
12696
12697 @file{file-coding.*} implements the @dfn{coding-system} Lisp object
12698 type, which encapsulates a method of converting between different
12699 encodings. An encoding is a representation of a stream of characters,
12700 possibly from multiple character sets, using a stream of bytes or words,
12701 and defines (e.g.) which escape sequences are used to specify particular
12702 character sets, how the indices for a character are converted into bytes
12703 (sometimes this involves setting the high bit; sometimes complicated
12704 rearranging of the values takes place, as in the Shift-JIS encoding),
12705 etc. It also contains some generic coding system implementations, such
12706 as the binary (no-conversion) coding system and a sample gzip coding system.
12707
12708 @file{mule-coding.c} contains the implementations of text coding systems.
12709
12710 @file{mule-ccl.c} provides the CCL (Code Conversion Language)
12711 interpreter. CCL is similar in spirit to Lisp byte code and is used to
12712 implement converters for custom encodings.
12713
12714 @file{mule-canna.c} and @file{mule-wnnfns.c} implement interfaces to
12715 external programs used to implement the Canna and WNN input methods,
12716 respectively. This is currently in beta.
12717
12718 @file{mule-mcpath.c} provides some functions to allow for pathnames
12719 containing extended characters. This code is fragmentary, obsolete, and
12720 completely non-working. Instead, @code{pathname-coding-system} is used
12721 to specify conversions of names of files and directories. The standard
12722 C I/O functions like @samp{open()} are wrapped so that conversion occurs
12723 automatically.
12724
12725 @file{mule.c} contains a few miscellaneous things. It currently seems
12726 to be unused and probably should be removed.
12727
12728
12729
12730 @example
12731 @file{intl.c}
12732 @end example
12733
12734 This provides some miscellaneous internationalization code for
12735 implementing message translation and interfacing to the Ximp input
12736 method. None of this code is currently working.
12737
12738
12739
12740 @example
12741 @file{iso-wide.h}
12742 @end example
12743
12744 This contains leftover code from an earlier implementation of
12745 Asian-language support, and is not currently used.
12746
12747
12748 @node The Lisp Reader and Compiler, Lstreams, Multilingual Support, Top
12749 @chapter The Lisp Reader and Compiler
12750 @cindex Lisp reader and compiler, the
12751 @cindex reader and compiler, the Lisp
12752 @cindex compiler, the Lisp reader and
12753
12754 Not yet documented.
12755
12756 @node Lstreams, Consoles; Devices; Frames; Windows, The Lisp Reader and Compiler, Top
12757 @chapter Lstreams 14846 @chapter Lstreams
12758 @cindex lstreams 14847 @cindex lstreams
12759 14848
12760 An @dfn{lstream} is an internal Lisp object that provides a generic 14849 An @dfn{lstream} is an internal Lisp object that provides a generic
12761 buffering stream implementation. Conceptually, you send data to the 14850 buffering stream implementation. Conceptually, you send data to the
12981 @deftypefn {Lstream Method} Lisp_Object marker (Lisp_Object @var{lstream}, void (*@var{markfun}) (Lisp_Object)) 15070 @deftypefn {Lstream Method} Lisp_Object marker (Lisp_Object @var{lstream}, void (*@var{markfun}) (Lisp_Object))
12982 Mark this object for garbage collection. Same semantics as a standard 15071 Mark this object for garbage collection. Same semantics as a standard
12983 @code{Lisp_Object} marker. This function can be @code{NULL}. 15072 @code{Lisp_Object} marker. This function can be @code{NULL}.
12984 @end deftypefn 15073 @end deftypefn
12985 15074
12986 @node Consoles; Devices; Frames; Windows, The Redisplay Mechanism, Lstreams, Top 15075 @node Subprocesses, Interface to MS Windows, Lstreams, Top
12987 @chapter Consoles; Devices; Frames; Windows
12988 @cindex consoles; devices; frames; windows
12989 @cindex devices; frames; windows, consoles;
12990 @cindex frames; windows, consoles; devices;
12991 @cindex windows, consoles; devices; frames;
12992
12993 @menu
12994 * Introduction to Consoles; Devices; Frames; Windows::
12995 * Point::
12996 * Window Hierarchy::
12997 * The Window Object::
12998 * Modules for the Basic Displayable Lisp Objects::
12999 @end menu
13000
13001 @node Introduction to Consoles; Devices; Frames; Windows, Point, Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows
13002 @section Introduction to Consoles; Devices; Frames; Windows
13003 @cindex consoles; devices; frames; windows, introduction to
13004 @cindex devices; frames; windows, introduction to consoles;
13005 @cindex frames; windows, introduction to consoles; devices;
13006 @cindex windows, introduction to consoles; devices; frames;
13007
13008 A window-system window that you see on the screen is called a
13009 @dfn{frame} in Emacs terminology. Each frame is subdivided into one or
13010 more non-overlapping panes, called (confusingly) @dfn{windows}. Each
13011 window displays the text of a buffer in it. (See above on Buffers.) Note
13012 that buffers and windows are independent entities: Two or more windows
13013 can be displaying the same buffer (potentially in different locations),
13014 and a buffer can be displayed in no windows.
13015
13016 A single display screen that contains one or more frames is called
13017 a @dfn{display}. Under most circumstances, there is only one display.
13018 However, more than one display can exist, for example if you have
13019 a @dfn{multi-headed} console, i.e. one with a single keyboard but
13020 multiple displays. (Typically in such a situation, the various
13021 displays act like one large display, in that the mouse is only
13022 in one of them at a time, and moving the mouse off of one moves
13023 it into another.) In some cases, the different displays will
13024 have different characteristics, e.g. one color and one mono.
13025
13026 XEmacs can display frames on multiple displays. It can even deal
13027 simultaneously with frames on multiple keyboards (called @dfn{consoles} in
13028 XEmacs terminology). Here is one case where this might be useful: You
13029 are using XEmacs on your workstation at work, and leave it running.
13030 Then you go home and dial in on a TTY line, and you can use the
13031 already-running XEmacs process to display another frame on your local
13032 TTY.
13033
13034 Thus, there is a hierarchy console -> display -> frame -> window.
13035 There is a separate Lisp object type for each of these four concepts.
13036 Furthermore, there is logically a @dfn{selected console},
13037 @dfn{selected display}, @dfn{selected frame}, and @dfn{selected window}.
13038 Each of these objects is distinguished in various ways, such as being the
13039 default object for various functions that act on objects of that type.
13040 Note that every containing object remembers the ``selected'' object
13041 among the objects that it contains: e.g. not only is there a selected
13042 window, but every frame remembers the last window in it that was
13043 selected, and changing the selected frame causes the remembered window
13044 within it to become the selected window. Similar relationships apply
13045 for consoles to devices and devices to frames.
13046
13047 @node Point, Window Hierarchy, Introduction to Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows
13048 @section Point
13049 @cindex point
13050
13051 Recall that every buffer has a current insertion position, called
13052 @dfn{point}. Now, two or more windows may be displaying the same buffer,
13053 and the text cursor in the two windows (i.e. @code{point}) can be in
13054 two different places. You may ask, how can that be, since each
13055 buffer has only one value of @code{point}? The answer is that each window
13056 also has a value of @code{point} that is squirreled away in it. There
13057 is only one selected window, and the value of ``point'' in that buffer
13058 corresponds to that window. When the selected window is changed
13059 from one window to another displaying the same buffer, the old
13060 value of @code{point} is stored into the old window's ``point'' and the
13061 value of @code{point} from the new window is retrieved and made the
13062 value of @code{point} in the buffer. This means that @code{window-point}
13063 for the selected window is potentially inaccurate, and if you
13064 want to retrieve the correct value of @code{point} for a window,
13065 you must special-case on the selected window and retrieve the
13066 buffer's point instead. This is related to why @code{save-window-excursion}
13067 does not save the selected window's value of @code{point}.
13068
13069 @node Window Hierarchy, The Window Object, Point, Consoles; Devices; Frames; Windows
13070 @section Window Hierarchy
13071 @cindex window hierarchy
13072 @cindex hierarchy of windows
13073
13074 If a frame contains multiple windows (panes), they are always created
13075 by splitting an existing window along the horizontal or vertical axis.
13076 Terminology is a bit confusing here: to @dfn{split a window
13077 horizontally} means to create two side-by-side windows, i.e. to make a
13078 @emph{vertical} cut in a window. Likewise, to @dfn{split a window
13079 vertically} means to create two windows, one above the other, by making
13080 a @emph{horizontal} cut.
13081
13082 If you split a window and then split again along the same axis, you
13083 will end up with a number of panes all arranged along the same axis.
13084 The precise way in which the splits were made should not be important,
13085 and this is reflected internally. Internally, all windows are arranged
13086 in a tree, consisting of two types of windows, @dfn{combination} windows
13087 (which have children, and are covered completely by those children) and
13088 @dfn{leaf} windows, which have no children and are visible. Every
13089 combination window has two or more children, all arranged along the same
13090 axis. There are (logically) two subtypes of windows, depending on
13091 whether their children are horizontally or vertically arrayed. There is
13092 always one root window, which is either a leaf window (if the frame
13093 contains only one window) or a combination window (if the frame contains
13094 more than one window). In the latter case, the root window will have
13095 two or more children, either horizontally or vertically arrayed, and
13096 each of those children will be either a leaf window or another
13097 combination window.
13098
13099 Here are some rules:
13100
13101 @enumerate
13102 @item
13103 Horizontal combination windows can never have children that are
13104 horizontal combination windows; same for vertical.
13105
13106 @item
13107 Only leaf windows can be split (obviously) and this splitting does one
13108 of two things: (a) turns the leaf window into a combination window and
13109 creates two new leaf children, or (b) turns the leaf window into one of
13110 the two new leaves and creates the other leaf. Rule (1) dictates which
13111 of these two outcomes happens.
13112
13113 @item
13114 Every combination window must have at least two children.
13115
13116 @item
13117 Leaf windows can never become combination windows. They can be deleted,
13118 however. If this results in a violation of (3), the parent combination
13119 window also gets deleted.
13120
13121 @item
13122 All functions that accept windows must be prepared to accept combination
13123 windows, and do something sane (e.g. signal an error if so).
13124 Combination windows @emph{do} escape to the Lisp level.
13125
13126 @item
13127 All windows have three fields governing their contents:
13128 these are @dfn{hchild} (a list of horizontally-arrayed children),
13129 @dfn{vchild} (a list of vertically-arrayed children), and @dfn{buffer}
13130 (the buffer contained in a leaf window). Exactly one of
13131 these will be non-@code{nil}. Remember that @dfn{horizontally-arrayed}
13132 means ``side-by-side'' and @dfn{vertically-arrayed} means
13133 @dfn{one above the other}.
13134
13135 @item
13136 Leaf windows also have markers in their @code{start} (the
13137 first buffer position displayed in the window) and @code{pointm}
13138 (the window's stashed value of @code{point}---see above) fields,
13139 while combination windows have @code{nil} in these fields.
13140
13141 @item
13142 The list of children for a window is threaded through the
13143 @code{next} and @code{prev} fields of each child window.
13144
13145 @item
13146 @strong{Deleted windows can be undeleted}. This happens as a result of
13147 restoring a window configuration, and is unlike frames, displays, and
13148 consoles, which, once deleted, can never be restored. Deleting a window
13149 does nothing except set a special @code{dead} bit to 1 and clear out the
13150 @code{next}, @code{prev}, @code{hchild}, and @code{vchild} fields, for
13151 GC purposes.
13152
13153 @item
13154 Most frames actually have two top-level windows---one for the
13155 minibuffer and one (the @dfn{root}) for everything else. The modeline
13156 (if present) separates these two. The @code{next} field of the root
13157 points to the minibuffer, and the @code{prev} field of the minibuffer
13158 points to the root. The other @code{next} and @code{prev} fields are
13159 @code{nil}, and the frame points to both of these windows.
13160 Minibuffer-less frames have no minibuffer window, and the @code{next}
13161 and @code{prev} of the root window are @code{nil}. Minibuffer-only
13162 frames have no root window, and the @code{next} of the minibuffer window
13163 is @code{nil} but the @code{prev} points to itself. (#### This is an
13164 artifact that should be fixed.)
13165 @end enumerate
13166
13167 @node The Window Object, Modules for the Basic Displayable Lisp Objects, Window Hierarchy, Consoles; Devices; Frames; Windows
13168 @section The Window Object
13169 @cindex window object, the
13170 @cindex object, the window
13171
13172 Windows have the following accessible fields:
13173
13174 @table @code
13175 @item frame
13176 The frame that this window is on.
13177
13178 @item mini_p
13179 Non-@code{nil} if this window is a minibuffer window.
13180
13181 @item buffer
13182 The buffer that the window is displaying. This may change often during
13183 the life of the window.
13184
13185 @item dedicated
13186 Non-@code{nil} if this window is dedicated to its buffer.
13187
13188 @item pointm
13189 @cindex window point internals
13190 This is the value of point in the current buffer when this window is
13191 selected; when it is not selected, it retains its previous value.
13192
13193 @item start
13194 The position in the buffer that is the first character to be displayed
13195 in the window.
13196
13197 @item force_start
13198 If this flag is non-@code{nil}, it says that the window has been
13199 scrolled explicitly by the Lisp program. This affects what the next
13200 redisplay does if point is off the screen: instead of scrolling the
13201 window to show the text around point, it moves point to a location that
13202 is on the screen.
13203
13204 @item last_modified
13205 The @code{modified} field of the window's buffer, as of the last time
13206 a redisplay completed in this window.
13207
13208 @item last_point
13209 The buffer's value of point, as of the last time
13210 a redisplay completed in this window.
13211
13212 @item left
13213 This is the left-hand edge of the window, measured in columns. (The
13214 leftmost column on the screen is @w{column 0}.)
13215
13216 @item top
13217 This is the top edge of the window, measured in lines. (The top line on
13218 the screen is @w{line 0}.)
13219
13220 @item height
13221 The height of the window, measured in lines.
13222
13223 @item width
13224 The width of the window, measured in columns.
13225
13226 @item next
13227 This is the window that is the next in the chain of siblings. It is
13228 @code{nil} in a window that is the rightmost or bottommost of a group of
13229 siblings.
13230
13231 @item prev
13232 This is the window that is the previous in the chain of siblings. It is
13233 @code{nil} in a window that is the leftmost or topmost of a group of
13234 siblings.
13235
13236 @item parent
13237 Internally, XEmacs arranges windows in a tree; each group of siblings has
13238 a parent window whose area includes all the siblings. This field points
13239 to a window's parent.
13240
13241 Parent windows do not display buffers, and play little role in display
13242 except to shape their child windows. Emacs Lisp programs usually have
13243 no access to the parent windows; they operate on the windows at the
13244 leaves of the tree, which actually display buffers.
13245
13246 @item hscroll
13247 This is the number of columns that the display in the window is scrolled
13248 horizontally to the left. Normally, this is 0.
13249
13250 @item use_time
13251 This is the last time that the window was selected. The function
13252 @code{get-lru-window} uses this field.
13253
13254 @item display_table
13255 The window's display table, or @code{nil} if none is specified for it.
13256
13257 @item update_mode_line
13258 Non-@code{nil} means this window's mode line needs to be updated.
13259
13260 @item base_line_number
13261 The line number of a certain position in the buffer, or @code{nil}.
13262 This is used for displaying the line number of point in the mode line.
13263
13264 @item base_line_pos
13265 The position in the buffer for which the line number is known, or
13266 @code{nil} meaning none is known.
13267
13268 @item region_showing
13269 If the region (or part of it) is highlighted in this window, this field
13270 holds the mark position that made one end of that region. Otherwise,
13271 this field is @code{nil}.
13272 @end table
13273
13274 @node Modules for the Basic Displayable Lisp Objects, , The Window Object, Consoles; Devices; Frames; Windows
13275 @section Modules for the Basic Displayable Lisp Objects
13276 @cindex modules for the basic displayable Lisp objects
13277 @cindex displayable Lisp objects, modules for the basic
13278 @cindex Lisp objects, modules for the basic displayable
13279 @cindex objects, modules for the basic displayable Lisp
13280
13281 @example
13282 @file{console-msw.c}
13283 @file{console-msw.h}
13284 @file{console-stream.c}
13285 @file{console-stream.h}
13286 @file{console-tty.c}
13287 @file{console-tty.h}
13288 @file{console-x.c}
13289 @file{console-x.h}
13290 @file{console.c}
13291 @file{console.h}
13292 @end example
13293
13294 These modules implement the @dfn{console} Lisp object type. A console
13295 contains multiple display devices, but only one keyboard and mouse.
13296 Most of the time, a console will contain exactly one device.
13297
13298 Consoles are the top of a lisp object inclusion hierarchy. Consoles
13299 contain devices, which contain frames, which contain windows.
13300
13301
13302
13303 @example
13304 @file{device-msw.c}
13305 @file{device-tty.c}
13306 @file{device-x.c}
13307 @file{device.c}
13308 @file{device.h}
13309 @end example
13310
13311 These modules implement the @dfn{device} Lisp object type. This
13312 abstracts a particular screen or connection on which frames are
13313 displayed. As with Lisp objects, event interfaces, and other
13314 subsystems, the device code is separated into a generic component that
13315 contains a standardized interface (in the form of a set of methods) onto
13316 particular device types.
13317
13318 The device subsystem defines all the methods and provides method
13319 services for not only device operations but also for the frame, window,
13320 menubar, scrollbar, toolbar, and other displayable-object subsystems.
13321 The reason for this is that all of these subsystems have the same
13322 subtypes (X, TTY, NeXTstep, Microsoft Windows, etc.) as devices do.
13323
13324
13325
13326 @example
13327 @file{frame-msw.c}
13328 @file{frame-tty.c}
13329 @file{frame-x.c}
13330 @file{frame.c}
13331 @file{frame.h}
13332 @end example
13333
13334 Each device contains one or more frames in which objects (e.g. text) are
13335 displayed. A frame corresponds to a window in the window system;
13336 usually this is a top-level window but it could potentially be one of a
13337 number of overlapping child windows within a top-level window, using the
13338 MDI (Multiple Document Interface) protocol in Microsoft Windows or a
13339 similar scheme.
13340
13341 The @file{frame-*} files implement the @dfn{frame} Lisp object type and
13342 provide the generic and device-type-specific operations on frames
13343 (e.g. raising, lowering, resizing, moving, etc.).
13344
13345
13346
13347 @example
13348 @file{window.c}
13349 @file{window.h}
13350 @end example
13351
13352 @cindex window (in Emacs)
13353 @cindex pane
13354 Each frame consists of one or more non-overlapping @dfn{windows} (better
13355 known as @dfn{panes} in standard window-system terminology) in which a
13356 buffer's text can be displayed. Windows can also have scrollbars
13357 displayed around their edges.
13358
13359 @file{window.c} and @file{window.h} implement the @dfn{window} Lisp
13360 object type and provide code to manage windows. Since windows have no
13361 associated resources in the window system (the window system knows only
13362 about the frame; no child windows or anything are used for XEmacs
13363 windows), there is no device-type-specific code here; all of that code
13364 is part of the redisplay mechanism or the code for particular object
13365 types such as scrollbars.
13366
13367 @node The Redisplay Mechanism, Extents, Consoles; Devices; Frames; Windows, Top
13368 @chapter The Redisplay Mechanism
13369 @cindex redisplay mechanism, the
13370
13371 The redisplay mechanism is one of the most complicated sections of
13372 XEmacs, especially from a conceptual standpoint. This is doubly so
13373 because, unlike for the basic aspects of the Lisp interpreter, the
13374 computer science theories of how to efficiently handle redisplay are not
13375 well-developed.
13376
13377 When working with the redisplay mechanism, remember the Golden Rules
13378 of Redisplay:
13379
13380 @enumerate
13381 @item
13382 It Is Better To Be Correct Than Fast.
13383 @item
13384 Thou Shalt Not Run Elisp From Within Redisplay.
13385 @item
13386 It Is Better To Be Fast Than Not To Be.
13387 @end enumerate
13388
13389 @menu
13390 * Critical Redisplay Sections::
13391 * Line Start Cache::
13392 * Redisplay Piece by Piece::
13393 * Modules for the Redisplay Mechanism::
13394 * Modules for other Display-Related Lisp Objects::
13395 @end menu
13396
13397 @node Critical Redisplay Sections, Line Start Cache, The Redisplay Mechanism, The Redisplay Mechanism
13398 @section Critical Redisplay Sections
13399 @cindex redisplay sections, critical
13400 @cindex critical redisplay sections
13401
13402 Within this section, we are defenseless and assume that the
13403 following cannot happen:
13404
13405 @enumerate
13406 @item
13407 garbage collection
13408 @item
13409 Lisp code evaluation
13410 @item
13411 frame size changes
13412 @end enumerate
13413
13414 We ensure (3) by calling @code{hold_frame_size_changes()}, which
13415 will cause any pending frame size changes to get put on hold
13416 till after the end of the critical section. (1) follows
13417 automatically if (2) is met. #### Unfortunately, there are
13418 some places where Lisp code can be called within this section.
13419 We need to remove them.
13420
13421 If @code{Fsignal()} is called during this critical section, we
13422 will @code{abort()}.
13423
13424 If garbage collection is called during this critical section,
13425 we simply return. #### We should abort instead.
13426
13427 #### If a frame-size change does occur we should probably
13428 actually be preempting redisplay.
13429
13430 @node Line Start Cache, Redisplay Piece by Piece, Critical Redisplay Sections, The Redisplay Mechanism
13431 @section Line Start Cache
13432 @cindex line start cache
13433
13434 The traditional scrolling code in Emacs breaks in a variable height
13435 world. It depends on the key assumption that the number of lines that
13436 can be displayed at any given time is fixed. This led to a complete
13437 separation of the scrolling code from the redisplay code. In order to
13438 fully support variable height lines, the scrolling code must actually be
13439 tightly integrated with redisplay. Only redisplay can determine how
13440 many lines will be displayed on a screen for any given starting point.
13441
13442 What is ideally wanted is a complete list of the starting buffer
13443 position for every possible display line of a buffer along with the
13444 height of that display line. Maintaining such a full list would be very
13445 expensive. We settle for having it include information for all areas
13446 which we happen to generate anyhow (i.e. the region currently being
13447 displayed) and for those areas we need to work with.
13448
13449 In order to ensure that the cache accurately represents what redisplay
13450 would actually show, it is necessary to invalidate it in many
13451 situations. If the buffer changes, the starting positions may no longer
13452 be correct. If a face or an extent has changed then the line heights
13453 may have altered. These events happen frequently enough that the cache
13454 can end up being constantly disabled. With this potentially constant
13455 invalidation when is the cache ever useful?
13456
13457 Even if the cache is invalidated before every single usage, it is
13458 necessary. Scrolling often requires knowledge about display lines which
13459 are actually above or below the visible region. The cache provides a
13460 convenient light-weight method of storing this information for multiple
13461 display regions. This knowledge is necessary for the scrolling code to
13462 always obey the First Golden Rule of Redisplay.
13463
13464 If the cache already contains all of the information that the scrolling
13465 routines happen to need so that it doesn't have to go generate it, then
13466 we are able to obey the Third Golden Rule of Redisplay. The first thing
13467 we do to help out the cache is to always add the displayed region. This
13468 region had to be generated anyway, so the cache ends up getting the
13469 information basically for free. In those cases where a user is simply
13470 scrolling around viewing a buffer there is a high probability that this
13471 is sufficient to always provide the needed information. The second
13472 thing we can do is be smart about invalidating the cache.
13473
13474 TODO---Be smart about invalidating the cache. Potential places:
13475
13476 @itemize @bullet
13477 @item
13478 Insertions at end-of-line which don't cause line-wraps do not alter the
13479 starting positions of any display lines. These types of buffer
13480 modifications should not invalidate the cache. This is actually a large
13481 optimization for redisplay speed as well.
13482 @item
13483 Buffer modifications frequently only affect the display of lines at and
13484 below where they occur. In these situations we should only invalidate
13485 the part of the cache starting at where the modification occurs.
13486 @end itemize
13487
13488 In case you're wondering, the Second Golden Rule of Redisplay is not
13489 applicable.
13490
13491 @node Redisplay Piece by Piece, Modules for the Redisplay Mechanism, Line Start Cache, The Redisplay Mechanism
13492 @section Redisplay Piece by Piece
13493 @cindex redisplay piece by piece
13494
13495 As you can begin to see redisplay is complex and also not well
13496 documented. Chuck no longer works on XEmacs so this section is my take
13497 on the workings of redisplay.
13498
13499 Redisplay happens in three phases:
13500
13501 @enumerate
13502 @item
13503 Determine desired display in area that needs redisplay.
13504 Implemented by @code{redisplay.c}
13505 @item
13506 Compare desired display with current display
13507 Implemented by @code{redisplay-output.c}
13508 @item
13509 Output changes Implemented by @code{redisplay-output.c},
13510 @code{redisplay-x.c}, @code{redisplay-msw.c} and @code{redisplay-tty.c}
13511 @end enumerate
13512
13513 Steps 1 and 2 are device-independent and relatively complex. Step 3 is
13514 mostly device-dependent.
13515
13516 Determining the desired display
13517
13518 Display attributes are stored in @code{display_line} structures. Each
13519 @code{display_line} consists of a set of @code{display_block}'s and each
13520 @code{display_block} contains a number of @code{rune}'s. Generally
13521 dynarr's of @code{display_line}'s are held by each window representing
13522 the current display and the desired display.
13523
13524 The @code{display_line} structures are tightly tied to buffers which
13525 presents a problem for redisplay as this connection is bogus for the
13526 modeline. Hence the @code{display_line} generation routines are
13527 duplicated for generating the modeline. This means that the modeline
13528 display code has many bugs that the standard redisplay code does not.
13529
13530 The guts of @code{display_line} generation are in
13531 @code{create_text_block}, which creates a single display line for the
13532 desired locale. This incrementally parses the characters on the current
13533 line and generates redisplay structures for each.
13534
13535 Gutter redisplay is different. Because the data to display is stored in
13536 a string we cannot use @code{create_text_block}. Instead we use
13537 @code{create_text_string_block} which performs the same function as
13538 @code{create_text_block} but for strings. Many of the complexities of
13539 @code{create_text_block} to do with cursor handling and selective
13540 display have been removed.
13541
13542 @node Modules for the Redisplay Mechanism, Modules for other Display-Related Lisp Objects, Redisplay Piece by Piece, The Redisplay Mechanism
13543 @section Modules for the Redisplay Mechanism
13544 @cindex modules for the redisplay mechanism
13545 @cindex redisplay mechanism, modules for the
13546
13547 @example
13548 @file{redisplay-output.c}
13549 @file{redisplay-msw.c}
13550 @file{redisplay-tty.c}
13551 @file{redisplay-x.c}
13552 @file{redisplay.c}
13553 @file{redisplay.h}
13554 @end example
13555
13556 These files provide the redisplay mechanism. As with many other
13557 subsystems in XEmacs, there is a clean separation between the general
13558 and device-specific support.
13559
13560 @file{redisplay.c} contains the bulk of the redisplay engine. These
13561 functions update the redisplay structures (which describe how the screen
13562 is to appear) to reflect any changes made to the state of any
13563 displayable objects (buffer, frame, window, etc.) since the last time
13564 that redisplay was called. These functions are highly optimized to
13565 avoid doing more work than necessary (since redisplay is called
13566 extremely often and is potentially a huge time sink), and depend heavily
13567 on notifications from the objects themselves that changes have occurred,
13568 so that redisplay doesn't explicitly have to check each possible object.
13569 The redisplay mechanism also contains a great deal of caching to further
13570 speed things up; some of this caching is contained within the various
13571 displayable objects.
13572
13573 @file{redisplay-output.c} goes through the redisplay structures and converts
13574 them into calls to device-specific methods to actually output the screen
13575 changes.
13576
13577 @file{redisplay-x.c} and @file{redisplay-tty.c} are two implementations
13578 of these redisplay output methods, for X frames and TTY frames,
13579 respectively.
13580
13581
13582
13583 @example
13584 @file{indent.c}
13585 @end example
13586
13587 This module contains various functions and Lisp primitives for
13588 converting between buffer positions and screen positions. These
13589 functions call the redisplay mechanism to do most of the work, and then
13590 examine the redisplay structures to get the necessary information. This
13591 module needs work.
13592
13593
13594
13595 @example
13596 @file{termcap.c}
13597 @file{terminfo.c}
13598 @file{tparam.c}
13599 @end example
13600
13601 These files contain functions for working with the termcap (BSD-style)
13602 and terminfo (System V style) databases of terminal capabilities and
13603 escape sequences, used when XEmacs is displaying in a TTY.
13604
13605
13606
13607 @example
13608 @file{cm.c}
13609 @file{cm.h}
13610 @end example
13611
13612 These files provide some miscellaneous TTY-output functions and should
13613 probably be merged into @file{redisplay-tty.c}.
13614
13615
13616
13617 @node Modules for other Display-Related Lisp Objects, , Modules for the Redisplay Mechanism, The Redisplay Mechanism
13618 @section Modules for other Display-Related Lisp Objects
13619 @cindex modules for other display-related Lisp objects
13620 @cindex display-related Lisp objects, modules for other
13621 @cindex Lisp objects, modules for other display-related
13622
13623 @example
13624 @file{faces.c}
13625 @file{faces.h}
13626 @end example
13627
13628
13629
13630 @example
13631 @file{bitmaps.h}
13632 @file{glyphs-eimage.c}
13633 @file{glyphs-msw.c}
13634 @file{glyphs-msw.h}
13635 @file{glyphs-widget.c}
13636 @file{glyphs-x.c}
13637 @file{glyphs-x.h}
13638 @file{glyphs.c}
13639 @file{glyphs.h}
13640 @end example
13641
13642
13643
13644 @example
13645 @file{objects-msw.c}
13646 @file{objects-msw.h}
13647 @file{objects-tty.c}
13648 @file{objects-tty.h}
13649 @file{objects-x.c}
13650 @file{objects-x.h}
13651 @file{objects.c}
13652 @file{objects.h}
13653 @end example
13654
13655
13656
13657 @example
13658 @file{menubar-msw.c}
13659 @file{menubar-msw.h}
13660 @file{menubar-x.c}
13661 @file{menubar.c}
13662 @file{menubar.h}
13663 @end example
13664
13665
13666
13667 @example
13668 @file{scrollbar-msw.c}
13669 @file{scrollbar-msw.h}
13670 @file{scrollbar-x.c}
13671 @file{scrollbar-x.h}
13672 @file{scrollbar.c}
13673 @file{scrollbar.h}
13674 @end example
13675
13676
13677
13678 @example
13679 @file{toolbar-msw.c}
13680 @file{toolbar-x.c}
13681 @file{toolbar.c}
13682 @file{toolbar.h}
13683 @end example
13684
13685
13686
13687 @example
13688 @file{font-lock.c}
13689 @end example
13690
13691 This file provides C support for syntax highlighting---i.e.
13692 highlighting different syntactic constructs of a source file in
13693 different colors, for easy reading. The C support is provided so that
13694 this is fast.
13695
13696
13697
13698 @example
13699 @file{dgif_lib.c}
13700 @file{gif_err.c}
13701 @file{gif_lib.h}
13702 @file{gifalloc.c}
13703 @end example
13704
13705 These modules decode GIF-format image files, for use with glyphs.
13706 These files were removed due to Unisys patent infringement concerns.
13707
13708
13709 @node Extents, Faces, The Redisplay Mechanism, Top
13710 @chapter Extents
13711 @cindex extents
13712
13713 @menu
13714 * Introduction to Extents:: Extents are ranges over text, with properties.
13715 * Extent Ordering:: How extents are ordered internally.
13716 * Format of the Extent Info:: The extent information in a buffer or string.
13717 * Zero-Length Extents:: A weird special case.
13718 * Mathematics of Extent Ordering:: A rigorous foundation.
13719 * Extent Fragments:: Cached information useful for redisplay.
13720 @end menu
13721
13722 @node Introduction to Extents, Extent Ordering, Extents, Extents
13723 @section Introduction to Extents
13724 @cindex extents, introduction to
13725
13726 Extents are regions over a buffer, with a start and an end position
13727 denoting the region of the buffer included in the extent. In
13728 addition, either end can be closed or open, meaning that the endpoint
13729 is or is not logically included in the extent. Insertion of a character
13730 at a closed endpoint causes the character to go inside the extent;
13731 insertion at an open endpoint causes the character to go outside.
13732
13733 Extent endpoints are stored using memory indices (see @file{insdel.c}),
13734 to minimize the amount of adjusting that needs to be done when
13735 characters are inserted or deleted.
13736
13737 (Formerly, extent endpoints at the gap could be either before or
13738 after the gap, depending on the open/closedness of the endpoint.
13739 The intent of this was to make it so that insertions would
13740 automatically go inside or out of extents as necessary with no
13741 further work needing to be done. It didn't work out that way,
13742 however, and just ended up complexifying and buggifying all the
13743 rest of the code.)
13744
13745 @node Extent Ordering, Format of the Extent Info, Introduction to Extents, Extents
13746 @section Extent Ordering
13747 @cindex extent ordering
13748
13749 Extents are compared using memory indices. There are two orderings
13750 for extents and both orders are kept current at all times. The normal
13751 or @dfn{display} order is as follows:
13752
13753 @example
13754 Extent A is ``less than'' extent B,
13755 that is, earlier in the display order,
13756 if: A-start < B-start,
13757 or if: A-start = B-start, and A-end > B-end
13758 @end example
13759
13760 So if two extents begin at the same position, the larger of them is the
13761 earlier one in the display order (@code{EXTENT_LESS} is true).
13762
13763 For the e-order, the same thing holds:
13764
13765 @example
13766 Extent A is ``less than'' extent B in e-order,
13767 that is, later in the buffer,
13768 if: A-end < B-end,
13769 or if: A-end = B-end, and A-start > B-start
13770 @end example
13771
13772 So if two extents end at the same position, the smaller of them is the
13773 earlier one in the e-order (@code{EXTENT_E_LESS} is true).
13774
13775 The display order and the e-order are complementary orders: any
13776 theorem about the display order also applies to the e-order if you swap
13777 all occurrences of ``display order'' and ``e-order'', ``less than'' and
13778 ``greater than'', and ``extent start'' and ``extent end''.
13779
13780 @node Format of the Extent Info, Zero-Length Extents, Extent Ordering, Extents
13781 @section Format of the Extent Info
13782 @cindex extent info, format of the
13783
13784 An extent-info structure consists of a list of the buffer or string's
13785 extents and a @dfn{stack of extents} that lists all of the extents over
13786 a particular position. The stack-of-extents info is used for
13787 optimization purposes---it basically caches some info that might
13788 be expensive to compute. Certain otherwise hard computations are easy
13789 given the stack of extents over a particular position, and if the
13790 stack of extents over a nearby position is known (because it was
13791 calculated at some prior point in time), it's easy to move the stack
13792 of extents to the proper position.
13793
13794 Given that the stack of extents is an optimization, and given that
13795 it requires memory, a string's stack of extents is wiped out each
13796 time a garbage collection occurs. Therefore, any time you retrieve
13797 the stack of extents, it might not be there. If you need it to
13798 be there, use the @code{_force} version.
13799
13800 Similarly, a string may or may not have an extent_info structure.
13801 (Generally it won't if there haven't been any extents added to the
13802 string.) So use the @code{_force} version if you need the extent_info
13803 structure to be there.
13804
13805 A list of extents is maintained as a double gap array. One gap array
13806 is ordered by start index (the @dfn{display order}) and the other is
13807 ordered by end index (the @dfn{e-order}). Note that positions in an
13808 extent list should logically be conceived of as referring @emph{to} a
13809 particular extent (as is the norm in programs) rather than sitting
13810 between two extents. Note also that callers of these functions should
13811 not be aware of the fact that the extent list is implemented as an
13812 array, except for the fact that positions are integers (this should be
13813 generalized to handle integers and linked list equally well).
13814
13815 A gap array is the same structure used by buffer text: an array of
13816 elements with a "gap" somewhere in the middle. Insertion and deletion
13817 happens by moving the gap to the insertion/deletion point, and then
13818 expanding/contracting as necessary. Gap arrays have a number of
13819 useful properties:
13820
13821 @enumerate
13822 @item
13823 They are space efficient, as there is no need for next/previous pointers.
13824
13825 @item
13826 If the items in them are sorted, locating an item is fast -- @math{O(log N)}.
13827
13828 @item
13829 Insertion and deletion is very fast (constant time, essentially) if the
13830 gap is near (which favors localized operations, as will usually be the
13831 case). Even if not, it requires only a block move of memory, which is
13832 generally a highly optimized operation on modern processors.
13833
13834 @item
13835 Code to manipulate them is relatively simple to write.
13836 @end enumerate
13837
13838 An alternative would be balanced binary trees, which have guaranteed
13839 @math{O(log N)} time for all operations (although the constant factors
13840 are not as good, and repeated localized operations will be slower than
13841 for a gap array). Such code is quite tricky to write, however.
13842
13843 @node Zero-Length Extents, Mathematics of Extent Ordering, Format of the Extent Info, Extents
13844 @section Zero-Length Extents
13845 @cindex zero-length extents
13846 @cindex extents, zero-length
13847
13848 Extents can be zero-length, and will end up that way if their endpoints
13849 are explicitly set that way or if their detachable property is @code{nil}
13850 and all the text in the extent is deleted. (The exception is open-open
13851 zero-length extents, which are barred from existing because there is
13852 no sensible way to define their properties. Deletion of the text in
13853 an open-open extent causes it to be converted into a closed-open
13854 extent.) Zero-length extents are primarily used to represent
13855 annotations, and behave as follows:
13856
13857 @enumerate
13858 @item
13859 Insertion at the position of a zero-length extent expands the extent
13860 if both endpoints are closed; goes after the extent if it is closed-open;
13861 and goes before the extent if it is open-closed.
13862
13863 @item
13864 Deletion of a character on a side of a zero-length extent whose
13865 corresponding endpoint is closed causes the extent to be detached if
13866 it is detachable; if the extent is not detachable or the corresponding
13867 endpoint is open, the extent remains in the buffer, moving as necessary.
13868 @end enumerate
13869
13870 Note that closed-open, non-detachable zero-length extents behave
13871 exactly like markers and that open-closed, non-detachable zero-length
13872 extents behave like the ``point-type'' marker in Mule.
13873
13874 @node Mathematics of Extent Ordering, Extent Fragments, Zero-Length Extents, Extents
13875 @section Mathematics of Extent Ordering
13876 @cindex mathematics of extent ordering
13877 @cindex extent mathematics
13878 @cindex extent ordering
13879
13880 @cindex display order of extents
13881 @cindex extents, display order
13882 The extents in a buffer are ordered by ``display order'' because that
13883 is that order that the redisplay mechanism needs to process them in.
13884 The e-order is an auxiliary ordering used to facilitate operations
13885 over extents. The operations that can be performed on the ordered
13886 list of extents in a buffer are
13887
13888 @enumerate
13889 @item
13890 Locate where an extent would go if inserted into the list.
13891 @item
13892 Insert an extent into the list.
13893 @item
13894 Remove an extent from the list.
13895 @item
13896 Map over all the extents that overlap a range.
13897 @end enumerate
13898
13899 (4) requires being able to determine the first and last extents
13900 that overlap a range.
13901
13902 NOTE: @dfn{overlap} is used as follows:
13903
13904 @itemize @bullet
13905 @item
13906 two ranges overlap if they have at least one point in common.
13907 Whether the endpoints are open or closed makes a difference here.
13908 @item
13909 a point overlaps a range if the point is contained within the
13910 range; this is equivalent to treating a point @math{P} as the range
13911 @math{[P, P]}.
13912 @item
13913 In the case of an @emph{extent} overlapping a point or range, the extent
13914 is normally treated as having closed endpoints. This applies
13915 consistently in the discussion of stacks of extents and such below.
13916 Note that this definition of overlap is not necessarily consistent with
13917 the extents that @code{map-extents} maps over, since @code{map-extents}
13918 sometimes pays attention to whether the endpoints of an extents are open
13919 or closed. But for our purposes, it greatly simplifies things to treat
13920 all extents as having closed endpoints.
13921 @end itemize
13922
13923 First, define @math{>}, @math{<}, @math{<=}, etc. as applied to extents
13924 to mean comparison according to the display order. Comparison between
13925 an extent @math{E} and an index @math{I} means comparison between
13926 @math{E} and the range @math{[I, I]}.
13927
13928 Also define @math{e>}, @math{e<}, @math{e<=}, etc. to mean comparison
13929 according to the e-order.
13930
13931 For any range @math{R}, define @math{R(0)} to be the starting index of
13932 the range and @math{R(1)} to be the ending index of the range.
13933
13934 For any extent @math{E}, define @math{E(next)} to be the extent directly
13935 following @math{E}, and @math{E(prev)} to be the extent directly
13936 preceding @math{E}. Assume @math{E(next)} and @math{E(prev)} can be
13937 determined from @math{E} in constant time. (This is because we store
13938 the extent list as a doubly linked list.)
13939
13940 Similarly, define @math{E(e-next)} and @math{E(e-prev)} to be the
13941 extents directly following and preceding @math{E} in the e-order.
13942
13943 Now:
13944
13945 Let @math{R} be a range.
13946 Let @math{F} be the first extent overlapping @math{R}.
13947 Let @math{L} be the last extent overlapping @math{R}.
13948
13949 Theorem 1: @math{R(1)} lies between @math{L} and @math{L(next)},
13950 i.e. @math{L <= R(1) < L(next)}.
13951
13952 This follows easily from the definition of display order. The
13953 basic reason that this theorem applies is that the display order
13954 sorts by increasing starting index.
13955
13956 Therefore, we can determine @math{L} just by looking at where we would
13957 insert @math{R(1)} into the list, and if we know @math{F} and are moving
13958 forward over extents, we can easily determine when we've hit @math{L} by
13959 comparing the extent we're at to @math{R(1)}.
13960
13961 @example
13962 Theorem 2: @math{F(e-prev) e< [1, R(0)] e<= F}.
13963 @end example
13964
13965 This is the analog of Theorem 1, and applies because the e-order
13966 sorts by increasing ending index.
13967
13968 Therefore, @math{F} can be found in the same amount of time as
13969 operation (1), i.e. the time that it takes to locate where an extent
13970 would go if inserted into the e-order list. This is @math{O(log N)},
13971 since we are using gap arrays to manage extents.
13972
13973 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents
13974 (ordered in display order and e-order, just like for normal extent
13975 lists) that overlap an index @math{I}.
13976
13977 Now:
13978
13979 Let @math{I} be an index, let @math{S} be the stack of extents on
13980 @math{I} and let @math{F} be the first extent in @math{S}.
13981
13982 Theorem 3: The first extent in @math{S} is the first extent that overlaps
13983 any range @math{[I, J]}.
13984
13985 Proof: Any extent that overlaps @math{[I, J]} but does not include
13986 @math{I} must have a start index @math{> I}, and thus be greater than
13987 any extent in @math{S}.
13988
13989 Therefore, finding the first extent that overlaps a range @math{R} is
13990 the same as finding the first extent that overlaps @math{R(0)}.
13991
13992 Theorem 4: Let @math{I2} be an index such that @math{I2 > I}, and let
13993 @math{F2} be the first extent that overlaps @math{I2}. Then, either
13994 @math{F2} is in @math{S} or @math{F2} is greater than any extent in
13995 @math{S}.
13996
13997 Proof: If @math{F2} does not include @math{I} then its start index is
13998 greater than @math{I} and thus it is greater than any extent in
13999 @math{S}, including @math{F}. Otherwise, @math{F2} includes @math{I}
14000 and thus is in @math{S}, and thus @math{F2 >= F}.
14001
14002 @node Extent Fragments, , Mathematics of Extent Ordering, Extents
14003 @section Extent Fragments
14004 @cindex extent fragments
14005 @cindex fragments, extent
14006
14007 Imagine that the buffer is divided up into contiguous, non-overlapping
14008 @dfn{runs} of text such that no extent starts or ends within a run
14009 (extents that abut the run don't count).
14010
14011 An extent fragment is a structure that holds data about the run that
14012 contains a particular buffer position (if the buffer position is at the
14013 junction of two runs, the run after the position is used)---the
14014 beginning and end of the run, a list of all of the extents in that run,
14015 the @dfn{merged face} that results from merging all of the faces
14016 corresponding to those extents, the begin and end glyphs at the
14017 beginning of the run, etc. This is the information that redisplay needs
14018 in order to display this run.
14019
14020 Extent fragments have to be very quick to update to a new buffer
14021 position when moving linearly through the buffer. They rely on the
14022 stack-of-extents code, which does the heavy-duty algorithmic work of
14023 determining which extents overly a particular position.
14024
14025 @node Faces, Glyphs, Extents, Top
14026 @chapter Faces
14027 @cindex faces
14028
14029 Not yet documented.
14030
14031 @node Glyphs, Specifiers, Faces, Top
14032 @chapter Glyphs
14033 @cindex glyphs
14034
14035 Glyphs are graphical elements that can be displayed in XEmacs buffers or
14036 gutters. We use the term graphical element here in the broadest possible
14037 sense since glyphs can be as mundane as text or as arcane as a native
14038 tab widget.
14039
14040 In XEmacs, glyphs represent the uninstantiated state of graphical
14041 elements, i.e. they hold all the information necessary to produce an
14042 image on-screen but the image need not exist at this stage, and multiple
14043 screen images can be instantiated from a single glyph.
14044
14045 @c #### find a place for this discussion
14046 @c The decision to make image specifiers a separate type is debatable.
14047 @c In fact, the design decision to create a separate image specifier
14048 @c type, rather than make glyphs themselves be specifiers, is
14049 @c debatable---the other properties of glyphs are rarely used and could
14050 @c conceivably have been incorporated into the glyph's instantiator.
14051 @c The rarely used glyph types (buffer, pointer, icon) could also have
14052 @c been incorporated into the instantiator.
14053
14054 Glyphs are lazily instantiated by calling one of the glyph
14055 functions. This usually occurs within redisplay when
14056 @code{Fglyph_height} is called. Instantiation causes an image-instance
14057 to be created and cached. This cache is on a per-device basis for all glyphs
14058 except widget-glyphs, and on a per-window basis for widgets-glyphs. The
14059 caching is done by @code{image_instantiate} and is necessary because it
14060 is generally possible to display an image-instance in multiple
14061 domains. For instance if we create a Pixmap, we can actually display
14062 this on multiple windows - even though we only need a single Pixmap
14063 instance to do this. If caching wasn't done then it would be necessary
14064 to create image-instances for every displayable occurrence of a glyph -
14065 and every usage - and this would be extremely memory and cpu intensive.
14066
14067 Widget-glyphs (a.k.a native widgets) are not cached in this way. This is
14068 because widget-glyph image-instances on screen are toolkit windows, and
14069 thus cannot be reused in multiple XEmacs domains. Thus widget-glyphs are
14070 cached on an XEmacs window basis.
14071
14072 Any action on a glyph first consults the cache before actually
14073 instantiating a widget.
14074
14075 @section Glyph Instantiation
14076 @cindex glyph instantiation
14077 @cindex instantiation, glyph
14078
14079 Glyph instantiation is a hairy topic and requires some explanation. The
14080 guts of glyph instantiation is contained within
14081 @code{image_instantiate}. A glyph contains an image which is a
14082 specifier. When a glyph function - for instance @code{Fglyph_height} -
14083 asks for a property of the glyph that can only be determined from its
14084 instantiated state, then the glyph image is instantiated and an image
14085 instance created. The instantiation process is governed by the specifier
14086 code and goes through a series of steps:
14087
14088 @itemize @bullet
14089 @item
14090 Validation. Instantiation of image instances happens dynamically - often
14091 within the guts of redisplay. Thus it is often not feasible to catch
14092 instantiator errors at instantiation time. Instead the instantiator is
14093 validated at the time it is added to the image specifier. This function
14094 is defined by @code{image_validate} and at a simple level validates
14095 keyword value pairs.
14096 @item
14097 Duplication. The specifier code by default takes a copy of the
14098 instantiator. This is reasonable for most specifiers but in the case of
14099 widget-glyphs can be problematic, since some of the properties in the
14100 instantiator - for instance callbacks - could cause infinite recursion
14101 in the copying process. Thus the image code defines a function -
14102 @code{image_copy_instantiator} - which will selectively copy values.
14103 This is controlled by the way that a keyword is defined either using
14104 @code{IIFORMAT_VALID_KEYWORD} or
14105 @code{IIFORMAT_VALID_NONCOPY_KEYWORD}. Note that the image caching and
14106 redisplay code relies on instantiator copying to ensure that current and
14107 new instantiators are actually different rather than referring to the
14108 same thing.
14109 @item
14110 Normalization. Once the instantiator has been copied it must be
14111 converted into a form that is viable at instantiation time. This can
14112 involve no changes at all, but typically involves things like converting
14113 file names to the actual data. This function is defined by
14114 @code{image_going_to_add} and @code{normalize_image_instantiator}.
14115 @item
14116 Instantiation. When an image instance is actually required for display
14117 it is instantiated using @code{image_instantiate}. This involves calling
14118 instantiate methods that are specific to the type of image being
14119 instantiated.
14120 @end itemize
14121
14122 The final instantiation phase also involves a number of steps. In order
14123 to understand these we need to describe a number of concepts.
14124
14125 An image is instantiated in a @dfn{domain}, where a domain can be any
14126 one of a device, frame, window or image-instance. The domain gives the
14127 image-instance context and identity and properties that affect the
14128 appearance of the image-instance may be different for the same glyph
14129 instantiated in different domains. An example is the face used to
14130 display the image-instance.
14131
14132 Although an image is instantiated in a particular domain the
14133 instantiation domain is not necessarily the domain in which the
14134 image-instance is cached. For example a pixmap can be instantiated in a
14135 window be actually be cached on a per-device basis. The domain in which
14136 the image-instance is actually cached is called the
14137 @dfn{governing-domain}. A governing-domain is currently either a device
14138 or a window. Widget-glyphs and text-glyphs have a window as a
14139 governing-domain, all other image-instances have a device as the
14140 governing-domain. The governing domain for an image-instance is
14141 determined using the governing_domain image-instance method.
14142
14143 @section Widget-Glyphs
14144 @cindex widget-glyphs
14145
14146 @section Widget-Glyphs in the MS-Windows Environment
14147 @cindex widget-glyphs in the MS-Windows environment
14148 @cindex MS-Windows environment, widget-glyphs in the
14149
14150 To Do
14151
14152 @section Widget-Glyphs in the X Environment
14153 @cindex widget-glyphs in the X environment
14154 @cindex X environment, widget-glyphs in the
14155
14156 Widget-glyphs under X make heavy use of lwlib (@pxref{Lucid Widget
14157 Library}) for manipulating the native toolkit objects. This is primarily
14158 so that different toolkits can be supported for widget-glyphs, just as
14159 they are supported for features such as menubars etc.
14160
14161 Lwlib is extremely poorly documented and quite hairy so here is my
14162 understanding of what goes on.
14163
14164 Lwlib maintains a set of widget_instances which mirror the hierarchical
14165 state of Xt widgets. I think this is so that widgets can be updated and
14166 manipulated generically by the lwlib library. For instance
14167 update_one_widget_instance can cope with multiple types of widget and
14168 multiple types of toolkit. Each element in the widget hierarchy is updated
14169 from its corresponding widget_instance by walking the widget_instance
14170 tree recursively.
14171
14172 This has desirable properties such as lw_modify_all_widgets which is
14173 called from @file{glyphs-x.c} and updates all the properties of a widget
14174 without having to know what the widget is or what toolkit it is from.
14175 Unfortunately this also has hairy properties such as making the lwlib
14176 code quite complex. And of course lwlib has to know at some level what
14177 the widget is and how to set its properties.
14178
14179 @node Specifiers, Menus, Glyphs, Top
14180 @chapter Specifiers
14181 @cindex specifiers
14182
14183 Not yet documented.
14184
14185 Specifiers are documented in depth in the Lisp Reference manual.
14186 @xref{Specifiers,,, lispref, XEmacs Lisp Reference Manual}. The code in
14187 @file{specifier.c} is pretty straightforward.
14188
14189 @node Menus, Subprocesses, Specifiers, Top
14190 @chapter Menus
14191 @cindex menus
14192
14193 A menu is set by setting the value of the variable
14194 @code{current-menubar} (which may be buffer-local) and then calling
14195 @code{set-menubar-dirty-flag} to signal a change. This will cause the
14196 menu to be redrawn at the next redisplay. The format of the data in
14197 @code{current-menubar} is described in @file{menubar.c}.
14198
14199 Internally the data in current-menubar is parsed into a tree of
14200 @code{widget_value's} (defined in @file{lwlib.h}); this is accomplished
14201 by the recursive function @code{menu_item_descriptor_to_widget_value()},
14202 called by @code{compute_menubar_data()}. Such a tree is deallocated
14203 using @code{free_widget_value()}.
14204
14205 @code{update_screen_menubars()} is one of the external entry points.
14206 This checks to see, for each screen, if that screen's menubar needs to
14207 be updated. This is the case if
14208
14209 @enumerate
14210 @item
14211 @code{set-menubar-dirty-flag} was called since the last redisplay. (This
14212 function sets the C variable menubar_has_changed.)
14213 @item
14214 The buffer displayed in the screen has changed.
14215 @item
14216 The screen has no menubar currently displayed.
14217 @end enumerate
14218
14219 @code{set_screen_menubar()} is called for each such screen. This
14220 function calls @code{compute_menubar_data()} to create the tree of
14221 widget_value's, then calls @code{lw_create_widget()},
14222 @code{lw_modify_all_widgets()}, and/or @code{lw_destroy_all_widgets()}
14223 to create the X-Toolkit widget associated with the menu.
14224
14225 @code{update_psheets()}, the other external entry point, actually
14226 changes the menus being displayed. It uses the widgets fixed by
14227 @code{update_screen_menubars()} and calls various X functions to ensure
14228 that the menus are displayed properly.
14229
14230 The menubar widget is set up so that @code{pre_activate_callback()} is
14231 called when the menu is first selected (i.e. mouse button goes down),
14232 and @code{menubar_selection_callback()} is called when an item is
14233 selected. @code{pre_activate_callback()} calls the function in
14234 activate-menubar-hook, which can change the menubar (this is described
14235 in @file{menubar.c}). If the menubar is changed,
14236 @code{set_screen_menubars()} is called.
14237 @code{menubar_selection_callback()} enqueues a menu event, putting in it
14238 a function to call (either @code{eval} or @code{call-interactively}) and
14239 its argument, which is the callback function or form given in the menu's
14240 description.
14241
14242 @node Subprocesses, Interface to MS Windows, Menus, Top
14243 @chapter Subprocesses 15076 @chapter Subprocesses
14244 @cindex subprocesses 15077 @cindex subprocesses
14245 15078
14246 The fields of a process are: 15079 The fields of a process are:
14247 15080
14848 Auto-generated Unicode encapsulation functions 15681 Auto-generated Unicode encapsulation functions
14849 @item intl-auto-encap-win32.h 15682 @item intl-auto-encap-win32.h
14850 Auto-generated Unicode encapsulation headers 15683 Auto-generated Unicode encapsulation headers
14851 @end table 15684 @end table
14852 15685
14853 @node Interface to the X Window System, Future Work, Interface to MS Windows, Top 15686 @node Interface to the X Window System, Dumping, Interface to MS Windows, Top
14854 @chapter Interface to the X Window System 15687 @chapter Interface to the X Window System
14855 @cindex X Window System, interface to the 15688 @cindex X Window System, interface to the
14856 15689
14857 Mostly undocumented. 15690 Mostly undocumented.
14858 15691
15144 @file{extw-*} is common code that is used for both the client and server. 15977 @file{extw-*} is common code that is used for both the client and server.
15145 15978
15146 Don't touch this code; something is liable to break if you do. 15979 Don't touch this code; something is liable to break if you do.
15147 15980
15148 15981
15149 @node Future Work, Future Work Discussion, Interface to the X Window System, Top 15982 @node Dumping, Future Work, Interface to the X Window System, Top
15983 @chapter Dumping
15984 @cindex dumping
15985
15986 @menu
15987 * Dumping Justification::
15988 * Overview::
15989 * Data descriptions::
15990 * Dumping phase::
15991 * Reloading phase::
15992 * Remaining issues::
15993 @end menu
15994
15995 @node Dumping Justification, Overview, Dumping, Dumping
15996 @section Dumping Justification
15997 @cindex dumping, justification
15998
15999 The C code of XEmacs is just a Lisp engine with a lot of built-in
16000 primitives useful for writing an editor. The editor itself is written
16001 mostly in Lisp, and represents around 100K lines of code. Loading and
16002 executing the initialization of all this code takes a bit a time (five
16003 to ten times the usual startup time of current xemacs) and requires
16004 having all the lisp source files around. Having to reload them each
16005 time the editor is started would not be acceptable.
16006
16007 The traditional solution to this problem is called dumping: the build
16008 process first creates the lisp engine under the name @file{temacs}, then
16009 runs it until it has finished loading and initializing all the lisp
16010 code, and eventually creates a new executable called @file{xemacs}
16011 including both the object code in @file{temacs} and all the contents of
16012 the memory after the initialization.
16013
16014 This solution, while working, has a huge problem: the creation of the
16015 new executable from the actual contents of memory is an extremely
16016 system-specific process, quite error-prone, and which interferes with a
16017 lot of system libraries (like malloc). It is even getting worse
16018 nowadays with libraries using constructors which are automatically
16019 called when the program is started (even before @code{main()}) which tend to
16020 crash when they are called multiple times, once before dumping and once
16021 after (IRIX 6.x @file{libz.so} pulls in some C++ image libraries thru
16022 dependencies which have this problem). Writing the dumper is also one
16023 of the most difficult parts of porting XEmacs to a new operating system.
16024 Basically, `dumping' is an operation that is just not officially
16025 supported on many operating systems.
16026
16027 The aim of the portable dumper is to solve the same problem as the
16028 system-specific dumper, that is to be able to reload quickly, using only
16029 a small number of files, the fully initialized lisp part of the editor,
16030 without any system-specific hacks.
16031
16032 @node Overview, Data descriptions, Dumping Justification, Dumping
16033 @section Overview
16034 @cindex dumping overview
16035
16036 The portable dumping system has to:
16037
16038 @enumerate
16039 @item
16040 At dump time, write all initialized, non-quickly-rebuildable data to a
16041 file [Note: currently named @file{xemacs.dmp}, but the name will
16042 change], along with all information needed for the reloading.
16043
16044 @item
16045 When starting xemacs, reload the dump file, relocate it to its new
16046 starting address if needed, and reinitialize all pointers to this
16047 data. Also, rebuild all the quickly rebuildable data.
16048 @end enumerate
16049
16050 Note: As of 21.5.18, the dump file has been moved inside of the
16051 executable, although there are still problems with this on some systems.
16052
16053 @node Data descriptions, Dumping phase, Overview, Dumping
16054 @section Data descriptions
16055 @cindex dumping data descriptions
16056
16057 The more complex task of the dumper is to be able to write memory blocks
16058 on the heap (lisp objects, i.e. lrecords, and C-allocated memory, such
16059 as structs and arrays) to disk and reload them at a different address,
16060 updating all the pointers they include in the process. This is done by
16061 using external data descriptions that give information about the layout
16062 of the blocks in memory.
16063
16064 The specification of these descriptions is in lrecord.h. A description
16065 of an lrecord is an array of struct memory_description. Each of these
16066 structs include a type, an offset in the block and some optional
16067 parameters depending on the type. For instance, here is the string
16068 description:
16069
16070 @example
16071 static const struct memory_description string_description[] = @{
16072 @{ XD_BYTECOUNT, offsetof (Lisp_String, size) @},
16073 @{ XD_OPAQUE_DATA_PTR, offsetof (Lisp_String, data), XD_INDIRECT(0, 1) @},
16074 @{ XD_LISP_OBJECT, offsetof (Lisp_String, plist) @},
16075 @{ XD_END @}
16076 @};
16077 @end example
16078
16079 The first line indicates a member of type Bytecount, which is used by
16080 the next, indirect directive. The second means "there is a pointer to
16081 some opaque data in the field @code{data}". The length of said data is
16082 given by the expression @code{XD_INDIRECT(0, 1)}, which means "the value
16083 in the 0th line of the description (welcome to C) plus one". The third
16084 line means "there is a Lisp_Object member @code{plist} in the Lisp_String
16085 structure". @code{XD_END} then ends the description.
16086
16087 This gives us all the information we need to move around what is pointed
16088 to by a memory block (C or lrecord) and, by transitivity, everything
16089 that it points to. The only missing information for dumping is the size
16090 of the block. For lrecords, this is part of the
16091 lrecord_implementation, so we don't need to duplicate it. For C blocks
16092 we use a struct sized_memory_description, which includes a size field
16093 and a pointer to an associated array of memory_description.
16094
16095 @node Dumping phase, Reloading phase, Data descriptions, Dumping
16096 @section Dumping phase
16097 @cindex dumping phase
16098
16099 Dumping is done by calling the function @code{pdump()} (in @file{dumper.c}) which is
16100 invoked from Fdump_emacs (in @file{emacs.c}). This function performs a number
16101 of tasks.
16102
16103 @menu
16104 * Object inventory::
16105 * Address allocation::
16106 * The header::
16107 * Data dumping::
16108 * Pointers dumping::
16109 @end menu
16110
16111 @node Object inventory, Address allocation, Dumping phase, Dumping phase
16112 @subsection Object inventory
16113 @cindex dumping object inventory
16114 @cindex memory blocks
16115
16116 The first task is to build the list of the objects to dump. This
16117 includes:
16118
16119 @itemize @bullet
16120 @item lisp objects
16121 @item other memory blocks (C structures, arrays. etc)
16122 @end itemize
16123
16124 We end up with one @code{pdump_block_list_elt} per object group (arrays
16125 of C structs are kept together) which includes a pointer to the first
16126 object of the group, the per-object size and the count of objects in the
16127 group, along with some other information which is initialized later.
16128
16129 These entries are linked together in @code{pdump_block_list} structures
16130 and can be enumerated thru either:
16131
16132 @enumerate
16133 @item
16134 the @code{pdump_object_table}, an array of @code{pdump_block_list}, one
16135 per lrecord type, indexed by type number.
16136
16137 @item
16138 the @code{pdump_opaque_data_list}, used for the opaque data which does
16139 not include pointers, and hence does not need descriptions.
16140
16141 @item
16142 the @code{pdump_desc_table}, which is a vector of
16143 @code{memory_description}/@code{pdump_block_list} pairs, used for
16144 non-opaque C memory blocks.
16145 @end enumerate
16146
16147 This uses a marking strategy similar to the garbage collector. Some
16148 differences though:
16149
16150 @enumerate
16151 @item
16152 We do not use the mark bit (which does not exist for generic memory blocks
16153 anyway); we use a big hash table instead.
16154
16155 @item
16156 We do not use the mark function of lrecords but instead rely on the
16157 external descriptions. This happens essentially because we need to
16158 follow pointers to generic memory blocks and opaque data in addition to
16159 Lisp_Object members.
16160 @end enumerate
16161
16162 This is done by @code{pdump_register_object()}, which handles
16163 Lisp_Object variables, and @code{pdump_register_block()} which handles
16164 generic memory blocks (C structures, arrays, etc.), which both delegate
16165 the description management to @code{pdump_register_sub()}.
16166
16167 The hash table doubles as a map object to pdump_block_list_elmt (i.e.
16168 allows us to look up a pdump_block_list_elmt with the object it points
16169 to). Entries are added with @code{pdump_add_block()} and looked up with
16170 @code{pdump_get_block()}. There is no need for entry removal. The hash
16171 value is computed quite simply from the object pointer by
16172 @code{pdump_make_hash()}.
16173
16174 The roots for the marking are:
16175
16176 @enumerate
16177 @item
16178 the @code{staticpro}'ed variables (there is a special
16179 @code{staticpro_nodump()} call for protected variables we do not want to
16180 dump).
16181
16182 @item
16183 the Lisp_Object variables registered via @code{dump_add_root_lisp_object}
16184 (@code{staticpro()} is equivalent to @code{staticpro_nodump()} +
16185 @code{dump_add_root_lisp_object()}).
16186
16187 @item
16188 the data-segment memory blocks registered via @code{dump_add_root_block}
16189 (for blocks with relocatable pointers), or @code{dump_add_opaque} (for
16190 "opaque" blocks with no relocatable pointers; this is just a shortcut
16191 for calling @code{dump_add_root_block} with a NULL description).
16192
16193 @item
16194 the pointer variables registered via @code{dump_add_root_block_ptr},
16195 each of which points to a block of heap memory (generally a C structure
16196 or array). Note that @code{dump_add_root_block_ptr} is not technically
16197 necessary, as a pointer variable can be seen as a special case of a
16198 data-segment memory block and registered using
16199 @code{dump_add_root_block}. Doing it this way, however, would require
16200 another level of static structures declared. Since pointer variables
16201 are quite common, @code{dump_add_root_block_ptr} is provided for
16202 convenience. Note also that internally we have to treat it separately
16203 from @code{dump_add_root_block} rather than writing the former as a call
16204 to the latter, since we don't have support for creating and using memory
16205 descriptions on the fly -- they must all be statically declared in the
16206 data-segment.
16207 @end enumerate
16208
16209 This does not include the GCPRO'ed variables, the specbinds, the
16210 catchtags, the backlist, the redisplay or the profiling info, since we
16211 do not want to rebuild the actual chain of lisp calls which end up to
16212 the dump-emacs call, only the global variables.
16213
16214 Weak lists and weak hash tables are dumped as if they were their
16215 non-weak equivalent (without changing their type, of course). This has
16216 not yet been a problem.
16217
16218 @node Address allocation, The header, Object inventory, Dumping phase
16219 @subsection Address allocation
16220 @cindex dumping address allocation
16221
16222
16223 The next step is to allocate the offsets of each of the objects in the
16224 final dump file. This is done by @code{pdump_allocate_offset()} which
16225 is called indirectly by @code{pdump_scan_by_alignment()}.
16226
16227 The strategy to deal with alignment problems uses these facts:
16228
16229 @enumerate
16230 @item
16231 real world alignment requirements are powers of two.
16232
16233 @item
16234 the C compiler is required to adjust the size of a struct so that you
16235 can have an array of them next to each other. This means you can have an
16236 upper bound of the alignment requirements of a given structure by
16237 looking at which power of two its size is a multiple.
16238
16239 @item
16240 the non-variant part of variable size lrecords has an alignment
16241 requirement of 4.
16242 @end enumerate
16243
16244 Hence, for each lrecord type, C struct type or opaque data block the
16245 alignment requirement is computed as a power of two, with a minimum of
16246 2^2 for lrecords. @code{pdump_scan_by_alignment()} then scans all the
16247 @code{pdump_block_list_elmt}'s, the ones with the highest requirements
16248 first. This ensures the best packing.
16249
16250 The maximum alignment requirement we take into account is 2^8.
16251
16252 @code{pdump_allocate_offset()} only has to do a linear allocation,
16253 starting at offset 256 (this leaves room for the header and keeps the
16254 alignments happy).
16255
16256 @node The header, Data dumping, Address allocation, Dumping phase
16257 @subsection The header
16258 @cindex dumping, the header
16259
16260 The next step creates the file and writes a header with a signature and
16261 some random information in it. The @code{reloc_address} field, which
16262 indicates at which address the file should be loaded if we want to avoid
16263 post-reload relocation, is set to 0. It then seeks to offset 256 (base
16264 offset for the objects).
16265
16266 @node Data dumping, Pointers dumping, The header, Dumping phase
16267 @subsection Data dumping
16268 @cindex data dumping
16269 @cindex dumping, data
16270
16271 The data is dumped in the same order as the addresses were allocated by
16272 @code{pdump_dump_data()}, called from @code{pdump_scan_by_alignment()}.
16273 This function copies the data to a temporary buffer, relocates all
16274 pointers in the object to the addresses allocated in step Address
16275 Allocation, and writes it to the file. Using the same order means that,
16276 if we are careful with lrecords whose size is not a multiple of 4, we
16277 are ensured that the object is always written at the offset in the file
16278 allocated in step Address Allocation.
16279
16280 @node Pointers dumping, , Data dumping, Dumping phase
16281 @subsection Pointers dumping
16282 @cindex pointers dumping
16283 @cindex dumping, pointers
16284
16285 A bunch of tables needed to reassign properly the global pointers are
16286 then written. They are:
16287
16288 @enumerate
16289 @item
16290 the pdump_root_block_ptrs dynarr
16291 @item
16292 the pdump_opaques dynarr
16293 @item
16294 a vector of all the offsets to the objects in the file that include a
16295 description (for faster relocation at reload time)
16296 @item
16297 the pdump_root_objects and pdump_weak_object_chains dynarrs.
16298 @end enumerate
16299
16300 For each of the dynarrs we write both the pointer to the variables and
16301 the relocated offset of the object they point to. Since these variables
16302 are global, the pointers are still valid when restarting the program and
16303 are used to regenerate the global pointers.
16304
16305 The @code{pdump_weak_object_chains} dynarr is a special case. The
16306 variables it points to are the head of weak linked lists of lisp objects
16307 of the same type. Not all objects of this list are dumped so the
16308 relocated pointer we associate with them points to the first dumped
16309 object of the list, or Qnil if none is available. This is also the
16310 reason why they are not used as roots for the purpose of object
16311 enumeration.
16312
16313 Some very important information like the @code{staticpros} and
16314 @code{lrecord_implementations_table} are handled indirectly using
16315 @code{dump_add_opaque} or @code{dump_add_root_block_ptr}.
16316
16317 This is the end of the dumping part.
16318
16319 @node Reloading phase, Remaining issues, Dumping phase, Dumping
16320 @section Reloading phase
16321 @cindex reloading phase
16322 @cindex dumping, reloading phase
16323
16324 @subsection File loading
16325 @cindex dumping, file loading
16326
16327 The file is mmap'ed in memory (which ensures a PAGESIZE alignment, at
16328 least 4096), or if mmap is unavailable or fails, a 256-bytes aligned
16329 malloc is done and the file is loaded.
16330
16331 Some variables are reinitialized from the values found in the header.
16332
16333 The difference between the actual loading address and the reloc_address
16334 is computed and will be used for all the relocations.
16335
16336
16337 @subsection Putting back the pdump_opaques
16338 @cindex dumping, putting back the pdump_opaques
16339
16340 The memory contents are restored in the obvious and trivial way.
16341
16342
16343 @subsection Putting back the pdump_root_block_ptrs
16344 @cindex dumping, putting back the pdump_root_block_ptrs
16345
16346 The variables pointed to by pdump_root_block_ptrs in the dump phase are
16347 reset to the right relocated object addresses.
16348
16349
16350 @subsection Object relocation
16351 @cindex dumping, object relocation
16352
16353 All the objects are relocated using their description and their offset
16354 by @code{pdump_reloc_one}. This step is unnecessary if the
16355 reloc_address is equal to the file loading address.
16356
16357
16358 @subsection Putting back the pdump_root_objects and pdump_weak_object_chains
16359 @cindex dumping, putting back the pdump_root_objects and pdump_weak_object_chains
16360
16361 Same as Putting back the pdump_root_block_ptrs.
16362
16363
16364 @subsection Reorganize the hash tables
16365 @cindex dumping, reorganize the hash tables
16366
16367 Since some of the hash values in the lisp hash tables are
16368 address-dependent, their layout is now wrong. So we go through each of
16369 them and have them resorted by calling @code{pdump_reorganize_hash_table}.
16370
16371 @node Remaining issues, , Reloading phase, Dumping
16372 @section Remaining issues
16373 @cindex dumping, remaining issues
16374
16375 The build process will have to start a post-dump xemacs, ask it the
16376 loading address (which will, hopefully, be always the same between
16377 different xemacs invocations) [[unfortunately, not true on Linux with
16378 the ExecShield feature]] and relocate the file to the new address.
16379 This way the object relocation phase will not have to be done, which
16380 means no writes in the objects and that, because of the use of mmap, the
16381 dumped data will be shared between all the xemacs running on the
16382 computer.
16383
16384 Some executable signature will be necessary to ensure that a given dump
16385 file is really associated with a given executable, or random crashes
16386 will occur. Maybe a random number set at compile or configure time thru
16387 a define. This will also allow for having differently-compiled xemacsen
16388 on the same system (mule and no-mule comes to mind).
16389
16390 The DOC file contents should probably end up in the dump file.
16391
16392
16393 @node Future Work, Future Work Discussion, Dumping, Top
15150 @chapter Future Work 16394 @chapter Future Work
15151 @cindex future work 16395 @cindex future work
15152 16396
15153 @menu 16397 @menu
16398 * Future Work -- General Suggestions::
15154 * Future Work -- Elisp Compatibility Package:: 16399 * Future Work -- Elisp Compatibility Package::
15155 * Future Work -- Drag-n-Drop:: 16400 * Future Work -- Drag-n-Drop::
15156 * Future Work -- Standard Interface for Enabling Extensions:: 16401 * Future Work -- Standard Interface for Enabling Extensions::
15157 * Future Work -- Better Initialization File Scheme:: 16402 * Future Work -- Better Initialization File Scheme::
15158 * Future Work -- Keyword Parameters:: 16403 * Future Work -- Keyword Parameters::
15173 * Future Work -- Display Tables:: 16418 * Future Work -- Display Tables::
15174 * Future Work -- Making Elisp Function Calls Faster:: 16419 * Future Work -- Making Elisp Function Calls Faster::
15175 * Future Work -- Lisp Engine Replacement:: 16420 * Future Work -- Lisp Engine Replacement::
15176 @end menu 16421 @end menu
15177 16422
15178 @ignore 16423 @node Future Work -- General Suggestions, Future Work -- Elisp Compatibility Package, Future Work, Future Work
15179 Macro to convert a single line containing a heading into the format of 16424 @section Future Work -- General Suggestions
15180 all headings in the Future Work section. 16425 @cindex future work, general suggestions
15181 16426 @cindex general suggestions, future work
15182 (setq last-kbd-macro (read-kbd-macro 16427
15183 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Future SPC Work SPC - - SPC <end> RET @cindex SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC future SPC work RET")) 16428 @subheading Jamie Zawinski's XEmacs Wishlist
15184 @end ignore 16429
15185 16430 This document is based on Jamie Zawinski's
15186 @node Future Work -- Elisp Compatibility Package, Future Work -- Drag-n-Drop, Future Work, Future Work 16431 @uref{http://www.jwz.org/doc/xemacs-wishlist.html,xemacs wishlist}.
16432 Throughout this page, ``I'' refers to Jamie.
16433
16434 The list has been substantially reformatted and edited to fit the needs
16435 of this site. If you have any soul at all, you'll go check out the
16436 original. OK? You should also check out some other
16437 @uref{http://www.xemacs.org/Releases/Public-21.2/execution.html#wishlists,wishlists}.
16438
16439
16440 @subsubheading About the List
16441
16442 I've ranked these (roughly) from easiest to hardest; though of all of
16443 them, I think the debugger improvements would be the most useful. I think
16444 the combination of emacs+gdb is the best Unix development environment
16445 currently available, but it's still lamentably primitive and extremely
16446 frustrating (much like Unix itself), especially if you know what kinds of
16447 features more modern integrated debuggers have.
16448
16449 @subsubheading XEmacs Wishlist
16450
16451 @table @strong
16452 @item Improve the keyboard macro system.
16453
16454 Keyboard macros are one of the most useful concepts that emacs has to
16455 offer, but there's room for improvement.
16456
16457 @table @strong
16458 @item Make it possible to embed one macro inside of another.
16459
16460 Often, I'll define a keyboard macro, and then realize that I've
16461 left something out, or that there's more that I need to do; for
16462 example, I may define a macro that does something to the current line,
16463 and then realize that I want to apply it to a lot of lines. So, I'd
16464 like this to work:
16465
16466 @example
16467 @kbd{C-x ( }
16468 ; start macro #1
16469 @kbd{... }
16470 ; (do stuff)
16471 @kbd{C-x ) }
16472 ; done with macro #1
16473 @kbd{... }
16474 ; (do stuff)
16475 @kbd{C-x ( }
16476 ; start macro #2
16477 @kbd{C-x e }
16478 ; execute macro #1 (splice it into macro #2)
16479 @kbd{C-s foo }
16480 ; move forward to the next spot
16481 @kbd{C-x ) }
16482 ; done with macro #2
16483 @kbd{C-u 1000 C-x e }
16484 ; apply the new macro
16485 @end example
16486
16487 That is, simply, one should be able to wrap new text around an
16488 existing macro. I can't tell you how many times I've defined a complex
16489 macro but left out the ``@kbd{C-n C-a}'' at the end...
16490
16491 Yes, you can accomplish this with M-x name-last-kbd-macro, but
16492 that's a pain. And it's also more permanent than I'd often like.
16493 @item Make it possible to correct errors when defining a macro.
16494
16495 Right now, the act of defining a macro stops if you get an error
16496 while defining it, and all of the characters you've already typed into
16497 the macro are gone. It needn't be that way. I think that, when that
16498 first error occurs, the user should be given the option of taking the
16499 last command off of the macro and trying again.
16500
16501 The macro-reader knows where the bounds of multi-character command
16502 sequences are, and it could even keep track of the corresponding undo
16503 records; rubbing out the previous entry on the macro could also undo
16504 any changes that command had made. (This should also work if the macro
16505 spans multiple buffers, and should restore window configurations as
16506 well.)
16507
16508 You'd want multi-level undo for this as well, so maybe the way to
16509 go would be to add some new key sequence which was used only as the
16510 back-up-inside-a-keyboard-macro-definition command.
16511
16512 I'm not totally sure that this would end up being very usable;
16513 maybe it would be too hard to deal with. Which brings us to:
16514 @item Make it possible to edit a keyboard macro after it has been defined.
16515
16516 I only just discovered @code{edit-kbd-macro} (@kbd{C-x C-k}).
16517 It is very, very cool.
16518
16519 The trick it does of showing the command which will be executed is
16520 somewhat error-prone, as it can only look up things in the current map
16521 or the global map; if the macro changed buffers, it wouldn't be
16522 displaying the right commands. (One of the things I often use macros
16523 for is operating on many files at once, by bringing up a dired buffer
16524 of those files, editing them, and then moving on to the next.)
16525
16526 However, if the act of recording a macro also kept track of the
16527 actual commands that had gotten executed, it could make use of that
16528 info as well.
16529
16530 Another way of editing a macro, other than as text in a buffer,
16531 would be to have a command which single-steps a macro: you would lean
16532 on the space bar to watch the macro execute one character (command?)
16533 at a time, and then when you reached the point you wanted to change,
16534 you could do some gesture to either: insert some keystrokes into the
16535 middle of the macro and then continue; or to replace the rest of the
16536 macro from here to the end; or something.
16537
16538 Another similar hack might be to convert a macro to the equivalent
16539 lisp code, so that one could tweak it later in ways that would be too
16540 hard to do from the keyboard (wrapping parts of it in @code{while} loops or
16541 something.) (@kbd{M-x insert-kbd-macro} isn't really what I'm
16542 talking about here: I mean insert the list of commands, not the list
16543 of keystrokes.)
16544 @end table
16545
16546 @item Save my wrists!
16547
16548 In the spirit of the `@code{teach-extended-commands-p}' variable,
16549 it would be interesting if emacs would keep track of what are the
16550 commands I use most often, perhaps grouped by proximity or mode -- it
16551 would then be more obvious which commands were most likely candidates
16552 for placement on a toolbar, or popup menu, or just a more convenient key
16553 binding.
16554
16555 Bonus points if it figures out that I type ``@kbd{bt\n}'' and
16556 ``@kbd{ret\ny\n}'' into my @samp{*gdb*} buffer about a hundred
16557 thousand times a day.
16558 @item XmCreateFileSelectionBox
16559
16560 The thing that ``File/Open...'' pops up has excellent @emph{hack}
16561 value, but as a user interface, it's an abomination. Isn't it time
16562 someone added a real file selection dialog already? (For the
16563 Motifly-challenged, the Athena-based file selector that GhostView uses
16564 seems adequate.)
16565 @item Improve the toolbar system.
16566
16567 It's great that XEmacs has a toolbar, but it's damn near impossible
16568 to customize it.
16569
16570 @table @strong
16571 @item Make it easy to define new toolbar buttons.
16572
16573 Currently, to define a toolbar button that has a text equivalent,
16574 one must edit a pixmap, and put the text there! That's prohibitive.
16575 One should be able to add some kind of generic toolbar button, with a
16576 plain icon or none at all, but which has a text label, without having
16577 to use a paint program.
16578 @item Make it easy to have customized, mode-local toolbars.
16579
16580 In my @code{c-mode-hook}, for example, I can add a couple of new
16581 keybindings, and delete a few others, and to do that, I don't have to
16582 duplicate the entire definition of the @code{c-mode-map}. Making
16583 mode-local additions and subtractions to the toolbars should be as
16584 easy.
16585 @item Make it easy to have customized, mode-local popup menus.
16586
16587 The same situation holds for the right-mouse-button popup menu; one
16588 should be able to add new commands to those menus without difficulty.
16589 One problem is that each mode which does have a popup menu implements
16590 it in a different way...
16591 @end table
16592
16593 @item Make the External Widget work.
16594
16595 About half of the work is done to make a replacement for the
16596 @code{XmText} widget which offloads editing responsibility to an
16597 external Emacs process. Someone should finish that. The benefit here
16598 would be that then, any Motif program could be linked such that all
16599 editing happened with a real Emacs behind it. (If you're Athena-minded,
16600 flavor with @code{Text} instead of @code{XmText} -- it's probably
16601 easy to make it work with both.)
16602
16603 The part of this that is done already is the ability to run an Emacs
16604 screen on a Window object that has been created by another process (this
16605 is what the @file{ExternalClient.c} and @file{ExternalShell.c} stuff
16606 is.) What is left to be done is, adding the text-widget-editor aspects
16607 of this.
16608
16609 First, the emacs screen being displayed on that window would have to
16610 be one without a modeline, and one which behaved sensibly in the context
16611 of ``I am a small multi-line text area embedded in a dialog box'' as
16612 opposed to ``I am a full-on text editor and lord of all that I survey.''
16613
16614 Second, the API that the (non-emacs-aware) user of the
16615 @code{XmText} widget expects would need to be implemented: give the
16616 caller the ability to pull the edited text string back out, and so on.
16617 The idea here being, hooking up emacs as the widget editor should be as
16618 transparent as possible.
16619 @item Bring the debugger interface into the eighties.
16620
16621 Some of you may have seen my @file{gdb-highlight.el}
16622 package, that I posted to gnu.emacs.sources last month. I think
16623 it's really cool, but there should be a lot more work in that direction.
16624 For those of you who haven't seen it, what it does is watch text that
16625 gets inserted into the @samp{*gdb*} buffer and make very nearly
16626 everything be clickable and have a context-sensitive menu. Generally,
16627 the types that are noticed are:
16628
16629 @itemize
16630 @item function names;
16631 @item variable and parameter names;
16632 @item structure slots;
16633 @item source file names;
16634 @item type names;
16635 @item breakpoint numbers;
16636 @item stack frame numbers.
16637 @end itemize
16638
16639 Any time one of those objects is presented in the @samp{*gdb*}
16640 buffer, it is mousable. Clicking middle button on it takes some default
16641 action (edits the function, selects the stack frame, disables the
16642 breakpoint, ...) Clicking the right button pops up a menu of commands,
16643 including commands specific to the object under the mouse, and/or other
16644 objects on the same line.
16645
16646 So that's all well and good, and I get far more joy out of what this
16647 code does for me than I expected, but there are still a bunch of
16648 limitations. The debugger interface needs to do much, much more.
16649
16650 @table @strong
16651 @item Make gdbsrc-mode not suck.
16652
16653 The idea behind @code{gdbsrc-mode} is on the side of the angels:
16654 one should be able to focus on the source code and not on the debugger
16655 buffer, absolutely. But the implementation is just awful.
16656
16657 First and foremost, it should not change ``modes'' (in the more
16658 general sense). Any commands that it defines should be on keys which
16659 are exclusively used for that purpose, not keys which are normally
16660 self-inserting. I can't be the only person who usually has occasion to
16661 actually @emph{edit} the sources which the debugger has chosen to
16662 display! Switching into and out of @code{gdbsrc-mode} is
16663 prohibitive.
16664
16665 I want to be looking at my sources at all times, yet I don't want
16666 to have to give up my source-editing gestures. I think the right way
16667 to accomplish this is to put the gdbsrc commands on the toolbar and on
16668 popup menus; or to let the user define their own keys (I could see
16669 devoting my @key{kp_enter} key to ``step'', or something common
16670 like that.)
16671
16672 Also it's extremely frustrating that one can't turn off gdbsrc mode
16673 once it has been loaded, without exiting and restarting emacs; that
16674 alone means that I'd probably never take the time to learn how to use
16675 it, without first having taken the time to repair it...
16676 @item Make it easier access to variable values.
16677
16678 I want to be able to double-click on a variable name to highlight
16679 it, and then drag it to the debugger window to have its value printed.
16680
16681 I want gestures that let me write as well as read: for example, to
16682 store value A into slot B.
16683 @item Make all breakpoints visible.
16684
16685 Any time there is a running gdb which has breakpoints, the buffers
16686 holding the lines on which those breakpoints are set should have icons
16687 in them. These icons should be context-sensitive: I should be able to
16688 pop up a menu to enable or disable them, to delete them, to change
16689 their commands or conditions.
16690
16691 I should also be able to @emph{move} them. It's
16692 annoying when you have a breakpoint with a complex condition or
16693 command on it, and then you realize that you really want it to be at a
16694 different location. I want to be able to drag-and-drop the icon to its
16695 new home.
16696 @item Make a debugger status display window.
16697
16698 @itemize
16699 @item
16700
16701 I want a window off to the side that shows persistent information
16702 -- it should have a pane which is a drag-editable, drag-reorderable
16703 representation of the elements on gdb's ``display'' list; they
16704 should be displayed here instead of being just dumped in with the
16705 rest of the output in the @samp{*gdb*} buffer.
16706 @item
16707
16708 I want a pane that displays the current call-stack and nothing
16709 else. I want a pane that displays the arguments and locals of the
16710 currently-selected frame and nothing else. I want these both to
16711 update as I move around on the stack.
16712 @item
16713
16714 Since the unfortunate reality is that excavating this information
16715 from gdb can be slow, it would be a good idea for these panes to
16716 have a toggle button on them which meant ``stop updating'', so that
16717 when I want to move fast, I can, but I can easily get the display
16718 back when I need it again.
16719 @end itemize
16720
16721 The reason for all of this is that I spend entirely too much time
16722 scrolling around in the @samp{*gdb*} buffer; with gdb-highlight, I
16723 can just click on a line in the backtrace output to go to that frame,
16724 but I find that I spend a lot of time @emph{looking} for that
16725 backtrace: since it's mixed in with all the other random output, I
16726 waste time looking around for things (and usually just give up and
16727 type ``@kbd{bt}'' again, then thrash around as the buffer scrolls,
16728 and I try to find the lower frames that I'm interested in, as they
16729 have invariably scrolled off the window already...
16730 @item Save and restore breakpoints across emacs/debugger sessions.
16731
16732 This would be especially handy given that gdb leaks like a sieve,
16733 and with a big program, I only get a few dozen relink-and-rerun
16734 attempts before gdb has blown my swap space.
16735 @item Keep breakpoints in sync with source lines.
16736
16737 When a program is recompiled and then reloaded into gdb, the
16738 breakpoints often end up in less-than-useful places. For example, when
16739 I edit text which occurs in a file anywhere before a breakpoint, emacs
16740 is aware that the line of the bp hasn't changed, but just that it is
16741 in a different place relative to the top of the file. Gdb doesn't know
16742 this, so your breakpoints end up getting set in the wrong places
16743 (usually the maximally inconvenient places, like @emph{after} a loop
16744 instead of @emph{inside} it). But emacs knows, so emacs should
16745 inform the debugger, and move the breakpoints back to the places they
16746 were intended to be.
16747 @end table
16748
16749 (Possibly the OOBR stuff does some of this, but can't tell, because
16750 I've never been able to get it to do anything but beep at me and mumble
16751 about environments. I find it pretty funny that the manual keeps
16752 explaining to me how intuitive it is, without actually giving me a clue
16753 how to launch it...)
16754 @item Add better dialog box features.
16755
16756 It'd be nice to be able to create more complex dialog boxes from
16757 emacs-lisp: ones with checkboxes, radio button groups, text fields, and
16758 popup menus.
16759 @item Add embeddable dialog boxes.
16760
16761 One of the things that the now-defunct Energize code (the C side of
16762 it, that is) could do was embed a dialog box between the toolbar and the
16763 main text area -- buffers could have control panels associated with
16764 them, that had all kinds of complex behavior.
16765 @item Make the mark-stack be visible.
16766
16767 You know, I've encountered people who have been using emacs for
16768 years, and never use the mark stack for navigation. I can't live without
16769 it; ``@kbd{C-u C-SPC}'' is among my most common gestures.
16770
16771 @enumerate
16772 @item
16773
16774 It would be a lot easier to realize what's going to happen if the
16775 marks on the mark stack were visible. They could be displayed as small
16776 ``caret'' glyphs, for example; something large enough to be visible,
16777 but not easily mistaken for a character or for the cursor.
16778 @item
16779
16780 The marks and the selected region should be visible in the
16781 scrollbar as well -- I don't remember where I first saw this idea, but
16782 it's very cool: there's a second, less-strongly-rendered ``thumb'' in
16783 the scrollbar which indicates the position and size of the selection;
16784 and there are tiny tick-marks which indicate the positions of the
16785 saved points.
16786 @item
16787
16788 Markers which are in registers (@code{point-to-register}, @kbd{C-x
16789 /}) should be displayed differently (more prominent.)
16790 @item
16791
16792 It'd be cool if you could pick up markers and move them around, to
16793 adjust the points you'll be coming back to later.
16794 @end enumerate
16795
16796 @item Write a new garbage collector.
16797
16798 The emacs GC is very primitive; it is also, fortunately, a
16799 rather well isolated module, and it would not be a very big task to swap
16800 it with a new one (once that new one was written, that is.) Someone
16801 should go bone up on modern GC techniques, and then just dive right
16802 in...
16803 @item Add support for lexical scope to the emacs-lisp runtime.
16804
16805 Yadda yadda, this list goes to eleven.
16806 @end table
16807
16808 @*
16809 Subject:
16810 @strong{Re: XEmacs wishlist}
16811 Date: Wed, 14 May 1997 16:18:23 -0700
16812 From: Jamie Zawinski <jwz@@netscape.com>
16813 Newsgroups: comp.emacs.xemacs, comp.emacs
16814
16815 Andreas Schwab wrote:
16816
16817 @quotation
16818 @emph{Use `C-u C-x (': }
16819
16820 @emph{start-kbd-macro:@*Non-nil arg (prefix arg) means append to last
16821 macro defined; This begins by re-executing that macro as if you typed it
16822 again. }
16823 @end quotation
16824
16825 Cool, I didn't know it did that...
16826
16827 But it only lets you append. I often want to prepend, or embed the
16828 macro multiple times (motion 1, C-x e, motion 2, C-x e, motion 3.)
16829
16830 @subheading 21.2 Showstoppers
16831
16832 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16833
16834 DISTRIBUTION ISSUES
16835
16836 A. Unified Source Tarball.
16837
16838 Packages go under root/lib/xemacs/xemacs-packages and no one ever has
16839 to mess with --package-path and the result can be moved from one
16840 directory to another pre- or post-install.
16841
16842
16843 Unified Binary Tarballs with Packages.
16844
16845 Same principles as above.
16846
16847 If people complain, we can also provide split binary tarballs
16848 (architecture dependent and independent) and place these files in a
16849 subdirectory so as not to confuse the majority just looking for one
16850 tarball.
16851
16852 Under Windows, we need to provide a WISE-style GUI setup program. It's
16853 already there but needs some work so you can select "all" packages
16854 easily (should be the default).
16855
16856 Parallel Root and Package Trees.
16857
16858 If the user downloads separately, the main source and the packages, he
16859 will naturally untar them into the same directory. This results in the
16860 parallel root and package structure. We should support this as a "last
16861 resort," i.e., if we find no packages anywhere and are about to resign
16862 ourselves to not having packages, then look for a parallel package
16863 tree. The user who sets things up like this should be able to either
16864 run in place or "make install" and get a proper installed
16865 XEmacs. Never should the user have to touch --package-path.
16866
16867 II. WINDOWS PRINTING
16868
16869 Looks like the internals are done but not the GUI. This must be
16870 working in 21.2.
16871
16872 III. WINDOWS MULE
16873
16874 Basic support should be there. There's already a patch to get things
16875 started and I'll be doing more work to make this real.
16876
16877 IV. GUTTER ETC.
16878
16879 This stuff needs to be "stable" and generally free from bugs. Any
16880 API's we create need to be well-reviewed or marked clearly as
16881 experimental.
16882
16883 V. PORTABLE DUMPER
16884
16885 Last bits need to be cleaned up. This should be made the "default" for
16886 a while to flush-out problems. Under Microsoft Windows, Portable
16887 Dumper must be the default in 21.2 because of the problems with the
16888 existing dump process.
16889
16890 COMMENT: I'd like to feature freeze this pretty soon and create a 21.3
16891 tree where all of my major overhauls of Mule-related stuff will go
16892 in. At the same time or around, we need to do the move-around in the
16893 repository (or create a new one) and "upgrade" to the latest CVS
16894 server.
16895
16896 @node Future Work -- Elisp Compatibility Package, Future Work -- Drag-n-Drop, Future Work -- General Suggestions, Future Work
15187 @section Future Work -- Elisp Compatibility Package 16897 @section Future Work -- Elisp Compatibility Package
15188 @cindex future work, elisp compatibility package 16898 @cindex future work, elisp compatibility package
15189 @cindex elisp compatibility package, future work 16899 @cindex elisp compatibility package, future work
16900
16901 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
15190 16902
15191 A while ago I created a package called Sysdep, which aimed to be a 16903 A while ago I created a package called Sysdep, which aimed to be a
15192 forward compatibility package for Elisp. The idea was that instead of 16904 forward compatibility package for Elisp. The idea was that instead of
15193 having to write your package using the oldest version of Emacs that you 16905 having to write your package using the oldest version of Emacs that you
15194 wanted to support, you could use the newest XEmacs API, and then simply 16906 wanted to support, you could use the newest XEmacs API, and then simply
15320 where a function is called using @code{funcall} or @code{apply}. 17032 where a function is called using @code{funcall} or @code{apply}.
15321 However, such uses of functions would not be affected by the surrounding 17033 However, such uses of functions would not be affected by the surrounding
15322 macrolet call, and so there doesn't appear to be any point in extracting 17034 macrolet call, and so there doesn't appear to be any point in extracting
15323 them). 17035 them).
15324 17036
15325 @uref{../../www.666.com/ben/default.htm,Ben Wing}
15326
15327 @node Future Work -- Drag-n-Drop, Future Work -- Standard Interface for Enabling Extensions, Future Work -- Elisp Compatibility Package, Future Work 17037 @node Future Work -- Drag-n-Drop, Future Work -- Standard Interface for Enabling Extensions, Future Work -- Elisp Compatibility Package, Future Work
15328 @section Future Work -- Drag-n-Drop 17038 @section Future Work -- Drag-n-Drop
15329 @cindex future work, drag-n-drop 17039 @cindex future work, drag-n-drop
15330 @cindex drag-n-drop, future work 17040 @cindex drag-n-drop, future work
17041
17042 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
15331 17043
15332 @strong{Abstract:} I propose completely redoing the drag-n-drop 17044 @strong{Abstract:} I propose completely redoing the drag-n-drop
15333 interface to make it powerful and extensible enough to support such 17045 interface to make it powerful and extensible enough to support such
15334 concepts as drag over and drag under visuals and context menus invoked 17046 concepts as drag over and drag under visuals and context menus invoked
15335 when a drag is done with the right mouse button, to allow drop handlers 17047 when a drag is done with the right mouse button, to allow drop handlers
15437 drop, etc. This event is always passed to any function that is invoked 17149 drop, etc. This event is always passed to any function that is invoked
15438 as a result of the drag or drop. There should never be any need to 17150 as a result of the drag or drop. There should never be any need to
15439 refer to the @code{current-mouse-event} variable, and in fact, this 17151 refer to the @code{current-mouse-event} variable, and in fact, this
15440 variable should not be changed at all during a drag or a drop. 17152 variable should not be changed at all during a drag or a drop.
15441 17153
15442 @uref{../../www.666.com/ben/default.htm,Ben Wing}
15443
15444 @node Future Work -- Standard Interface for Enabling Extensions, Future Work -- Better Initialization File Scheme, Future Work -- Drag-n-Drop, Future Work 17154 @node Future Work -- Standard Interface for Enabling Extensions, Future Work -- Better Initialization File Scheme, Future Work -- Drag-n-Drop, Future Work
15445 @section Future Work -- Standard Interface for Enabling Extensions 17155 @section Future Work -- Standard Interface for Enabling Extensions
15446 @cindex future work, standard interface for enabling extensions 17156 @cindex future work, standard interface for enabling extensions
15447 @cindex standard interface for enabling extensions, future work 17157 @cindex standard interface for enabling extensions, future work
17158
17159 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
15448 17160
15449 @strong{Abstract:} Apparently, if you know the name of a package (for 17161 @strong{Abstract:} Apparently, if you know the name of a package (for
15450 example, @code{fusion}), you can load it using the @code{require} 17162 example, @code{fusion}), you can load it using the @code{require}
15451 function, but there's no standard way to turn it on or turn it off. The 17163 function, but there's no standard way to turn it on or turn it off. The
15452 only way to figure out how to do that is to go read the source file, 17164 only way to figure out how to do that is to go read the source file,
15518 extensions and a judgment on first of all, how commonly a user might 17230 extensions and a judgment on first of all, how commonly a user might
15519 want this extension, and second of all, how well written and bug-free 17231 want this extension, and second of all, how well written and bug-free
15520 the package is. Both of these sorts of judgments could be obtained by 17232 the package is. Both of these sorts of judgments could be obtained by
15521 doing user surveys if need be. 17233 doing user surveys if need be.
15522 17234
15523 @uref{../../www.666.com/ben/default.htm,Ben Wing}
15524
15525 @node Future Work -- Better Initialization File Scheme, Future Work -- Keyword Parameters, Future Work -- Standard Interface for Enabling Extensions, Future Work 17235 @node Future Work -- Better Initialization File Scheme, Future Work -- Keyword Parameters, Future Work -- Standard Interface for Enabling Extensions, Future Work
15526 @section Future Work -- Better Initialization File Scheme 17236 @section Future Work -- Better Initialization File Scheme
15527 @cindex future work, better initialization file scheme 17237 @cindex future work, better initialization file scheme
15528 @cindex better initialization file scheme, future work 17238 @cindex better initialization file scheme, future work
17239
17240 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
15529 17241
15530 @strong{Abstract:} A proposal is outlined for converting XEmacs to use 17242 @strong{Abstract:} A proposal is outlined for converting XEmacs to use
15531 the @code{.xemacs} subdirectory for its initialization files instead of 17243 the @code{.xemacs} subdirectory for its initialization files instead of
15532 putting them in the user's home directory. In the process, a general 17244 putting them in the user's home directory. In the process, a general
15533 pre-initialization scheme is created whereby all of the initialization 17245 pre-initialization scheme is created whereby all of the initialization
15625 @code{init.el} or @code{pre-init.el}, or if neither of those files is 17337 @code{init.el} or @code{pre-init.el}, or if neither of those files is
15626 present, then it doesn't contain any sub-directories or files that look 17338 present, then it doesn't contain any sub-directories or files that look
15627 like what would be in a package root), then it becomes the value of the 17339 like what would be in a package root), then it becomes the value of the
15628 init file directory. Otherwise the user's home directory is used. 17340 init file directory. Otherwise the user's home directory is used.
15629 @item 17341 @item
15630
15631 17342
15632 If the init file directory is the user's home directory, then the init 17343 If the init file directory is the user's home directory, then the init
15633 file is called @code{.emacs}. Otherwise, it's called @code{init.el}. 17344 file is called @code{.emacs}. Otherwise, it's called @code{init.el}.
15634 @item 17345 @item
15635
15636 17346
15637 If the init file directory is the user's home directory, then the 17347 If the init file directory is the user's home directory, then the
15638 pre-init file is called @code{.xemacs-pre-init.el}. Otherwise it's 17348 pre-init file is called @code{.xemacs-pre-init.el}. Otherwise it's
15639 called @code{pre-init.el}. (One of the reasons for this rule has to do 17349 called @code{pre-init.el}. (One of the reasons for this rule has to do
15640 with the dialog box that might be displayed at startup. This will be 17350 with the dialog box that might be displayed at startup. This will be
15641 described below.) 17351 described below.)
15642 @item 17352 @item
15643
15644 17353
15645 If the init file directory is the user's home directory, then the custom 17354 If the init file directory is the user's home directory, then the custom
15646 init file is called @code{.xemacs-custom-init.el}. Otherwise, it's 17355 init file is called @code{.xemacs-custom-init.el}. Otherwise, it's
15647 called @code{custom-init.el}. 17356 called @code{custom-init.el}.
15648 17357
15712 17421
15713 If an error occurs in the init file, then the initial frame should 17422 If an error occurs in the init file, then the initial frame should
15714 always be created and mapped at that time so that the error is displayed 17423 always be created and mapped at that time so that the error is displayed
15715 and the debugger has a place to be invoked. 17424 and the debugger has a place to be invoked.
15716 17425
15717 @uref{../../www.666.com/ben/default.htm,Ben Wing}
15718
15719 @node Future Work -- Keyword Parameters, Future Work -- Property Interface Changes, Future Work -- Better Initialization File Scheme, Future Work 17426 @node Future Work -- Keyword Parameters, Future Work -- Property Interface Changes, Future Work -- Better Initialization File Scheme, Future Work
15720 @section Future Work -- Keyword Parameters 17427 @section Future Work -- Keyword Parameters
15721 @cindex future work, keyword parameters 17428 @cindex future work, keyword parameters
15722 @cindex keyword parameters, future work 17429 @cindex keyword parameters, future work
17430
17431 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
15723 17432
15724 NOTE: These changes are partly motivated by the various user-interface 17433 NOTE: These changes are partly motivated by the various user-interface
15725 changes elsewhere in this document, and partly for Mule support. In 17434 changes elsewhere in this document, and partly for Mule support. In
15726 general the various API's in this document would benefit greatly from 17435 general the various API's in this document would benefit greatly from
15727 built-in keywords. 17436 built-in keywords.
15770 @item 17479 @item
15771 17480
15772 The subr object type needs to be modified to contain additional slots 17481 The subr object type needs to be modified to contain additional slots
15773 for the number and names of any keyword parameters. 17482 for the number and names of any keyword parameters.
15774 @item 17483 @item
15775
15776 17484
15777 The implementation of the @code{funcall} function needs to be modified 17485 The implementation of the @code{funcall} function needs to be modified
15778 so that it knows how to process keyword parameters. This is the only 17486 so that it knows how to process keyword parameters. This is the only
15779 place that will require very much intricate coding, and much of the 17487 place that will require very much intricate coding, and much of the
15780 logic that would need to be added can be lifted directly from the 17488 logic that would need to be added can be lifted directly from the
15781 @code{cl} code. 17489 @code{cl} code.
15782 @item 17490 @item
15783
15784 17491
15785 A new macro, similar to the @code{DEFUN} macro, and probably called 17492 A new macro, similar to the @code{DEFUN} macro, and probably called
15786 @code{DEFUN_WITH_KEYWORDS}, needs to be defined so that built-in Lisp 17493 @code{DEFUN_WITH_KEYWORDS}, needs to be defined so that built-in Lisp
15787 primitives containing keywords can be created. Now, the 17494 primitives containing keywords can be created. Now, the
15788 @code{DEFUN_WITH_KEYWORDS} macro should take an additional parameter 17495 @code{DEFUN_WITH_KEYWORDS} macro should take an additional parameter
15802 that specifies the number of keyword parameters. However, this would 17509 that specifies the number of keyword parameters. However, this would
15803 require some additional complexity in the preprocessor definition of the 17510 require some additional complexity in the preprocessor definition of the
15804 @code{DEFUN_WITH_KEYWORDS} macro, and probably isn't worth 17511 @code{DEFUN_WITH_KEYWORDS} macro, and probably isn't worth
15805 implementing). 17512 implementing).
15806 @item 17513 @item
15807
15808 17514
15809 The byte compiler would have to be modified slightly so that it knows 17515 The byte compiler would have to be modified slightly so that it knows
15810 about keyword parameters when it parses the parameter declaration of a 17516 about keyword parameters when it parses the parameter declaration of a
15811 function. For example, so that it issues the correct warnings 17517 function. For example, so that it issues the correct warnings
15812 concerning calls to that function with incorrect arguments. 17518 concerning calls to that function with incorrect arguments.
15813 @item 17519 @item
15814
15815 17520
15816 The @code{make-docfile} program would have to be modified so that it 17521 The @code{make-docfile} program would have to be modified so that it
15817 generates the correct parameter lists for primitives defined using the 17522 generates the correct parameter lists for primitives defined using the
15818 @code{DEFUN_WITH_KEYWORDS} macro. 17523 @code{DEFUN_WITH_KEYWORDS} macro.
15819 @item 17524 @item
15820
15821 17525
15822 Possibly other aspects of the help system that deal with function 17526 Possibly other aspects of the help system that deal with function
15823 descriptions might have to be modified. 17527 descriptions might have to be modified.
15824 @item 17528 @item
15825
15826 17529
15827 A helper function might need to be defined to make it easier for 17530 A helper function might need to be defined to make it easier for
15828 primitives that use both the @code{&amp;rest} and @code{&amp;key} 17531 primitives that use both the @code{&amp;rest} and @code{&amp;key}
15829 specifiers to parse their argument lists. 17532 specifiers to parse their argument lists.
15830 17533
15890 @node Future Work -- Property Interface Changes, Future Work -- Toolbars, Future Work -- Keyword Parameters, Future Work 17593 @node Future Work -- Property Interface Changes, Future Work -- Toolbars, Future Work -- Keyword Parameters, Future Work
15891 @section Future Work -- Property Interface Changes 17594 @section Future Work -- Property Interface Changes
15892 @cindex future work, property interface changes 17595 @cindex future work, property interface changes
15893 @cindex property interface changes, future work 17596 @cindex property interface changes, future work
15894 17597
17598 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17599
15895 In my past work on XEmacs, I already expanded the standard property 17600 In my past work on XEmacs, I already expanded the standard property
15896 functions of @code{get}, @code{put}, and @code{remprop} to work on 17601 functions of @code{get}, @code{put}, and @code{remprop} to work on
15897 objects other than symbols and defined an additional function 17602 objects other than symbols and defined an additional function
15898 @code{object-plist} for this interface. I'd like to expand this 17603 @code{object-plist} for this interface. I'd like to expand this
15899 interface further and advertise it as the standard way to make property 17604 interface further and advertise it as the standard way to make property
15911 @dfn{unbound}, which is to say that its value has not been explicitly 17616 @dfn{unbound}, which is to say that its value has not been explicitly
15912 specified. Note: the way to make a property unbound is to call 17617 specified. Note: the way to make a property unbound is to call
15913 @code{remprop}. Note also that for some built-in properties, setting 17618 @code{remprop}. Note also that for some built-in properties, setting
15914 the property to its default value is equivalent to making it unbound. 17619 the property to its default value is equivalent to making it unbound.
15915 @item 17620 @item
15916
15917 17621
15918 The behavior of the @code{get} function is modified. If the @code{get} 17622 The behavior of the @code{get} function is modified. If the @code{get}
15919 function is called on a property that is unbound and the third, optional 17623 function is called on a property that is unbound and the third, optional
15920 @var{default} argument is @code{nil}, then the default value of the 17624 @var{default} argument is @code{nil}, then the default value of the
15921 property is returned. If the @var{default} argument is not @code{nil}, 17625 property is returned. If the @var{default} argument is not @code{nil},
15925 initial default value of @code{nil}. Code that calls the @code{get} 17629 initial default value of @code{nil}. Code that calls the @code{get}
15926 function and specifies @code{nil} for the @var{default} argument, and 17630 function and specifies @code{nil} for the @var{default} argument, and
15927 expects to get @code{nil} returned if the property is unbound, is almost 17631 expects to get @code{nil} returned if the property is unbound, is almost
15928 certainly wrong anyway. 17632 certainly wrong anyway.
15929 @item 17633 @item
15930
15931 17634
15932 A new function, @code{get1} is defined. This function does not take a 17635 A new function, @code{get1} is defined. This function does not take a
15933 default argument like the @code{get} function. Instead, if the property 17636 default argument like the @code{get} function. Instead, if the property
15934 is unbound, an error is signaled. Note: @code{get} can be implemented 17637 is unbound, an error is signaled. Note: @code{get} can be implemented
15935 in terms of @code{get1}. 17638 in terms of @code{get1}.
15936 @item 17639 @item
15937
15938 17640
15939 New functions @code{property-default-value} and @code{property-bound-p} 17641 New functions @code{property-default-value} and @code{property-bound-p}
15940 are defined with the obvious semantics. 17642 are defined with the obvious semantics.
15941 @item 17643 @item
15942
15943 17644
15944 An additional function @code{property-built-in-p} is defined which takes 17645 An additional function @code{property-built-in-p} is defined which takes
15945 two arguments, the first one being a symbol naming an object type, and 17646 two arguments, the first one being a symbol naming an object type, and
15946 the second one specifying a property, and indicates whether the property 17647 the second one specifying a property, and indicates whether the property
15947 name has a built-in meaning for objects of that type. 17648 name has a built-in meaning for objects of that type.
15948 @item 17649 @item
15949
15950 17650
15951 It is not necessary, or even desirable, for all object types to allow 17651 It is not necessary, or even desirable, for all object types to allow
15952 user-defined properties. It is always possible to simulate user-defined 17652 user-defined properties. It is always possible to simulate user-defined
15953 properties for an object by using a weak hash table. Therefore, whether 17653 properties for an object by using a weak hash table. Therefore, whether
15954 an object allows a user to define properties or not should depend on the 17654 an object allows a user to define properties or not should depend on the
15955 meaning of the object. If an object does not allow user-defined 17655 meaning of the object. If an object does not allow user-defined
15956 properties, the @code{put} function should signal an error, such as 17656 properties, the @code{put} function should signal an error, such as
15957 @code{undefined-property}, when given any property other than those that 17657 @code{undefined-property}, when given any property other than those that
15958 are predefined. 17658 are predefined.
15959 @item 17659 @item
15960
15961 17660
15962 A function called @code{user-defined-properties-allowed-p} should be 17661 A function called @code{user-defined-properties-allowed-p} should be
15963 defined with the obvious semantics. (See the previous item.) 17662 defined with the obvious semantics. (See the previous item.)
15964 @item 17663 @item
15965
15966 17664
15967 Three more functions should be defined, called 17665 Three more functions should be defined, called
15968 @code{built-in-property-name-list}, @code{property-name-list}, and 17666 @code{built-in-property-name-list}, @code{property-name-list}, and
15969 @code{user-defined-property-name-list}. 17667 @code{user-defined-property-name-list}.
15970 17668
15986 17684
15987 e.g. (define-property-method 'hash-table 17685 e.g. (define-property-method 'hash-table
15988 :put #'(lambda (obj key value) (puthash key obj value))) 17686 :put #'(lambda (obj key value) (puthash key obj value)))
15989 @end example 17687 @end example
15990 17688
15991
15992 @node Future Work -- Toolbars, Future Work -- Menu API Changes, Future Work -- Property Interface Changes, Future Work 17689 @node Future Work -- Toolbars, Future Work -- Menu API Changes, Future Work -- Property Interface Changes, Future Work
15993 @section Future Work -- Toolbars 17690 @section Future Work -- Toolbars
15994 @cindex future work, toolbars 17691 @cindex future work, toolbars
15995 @cindex toolbars 17692 @cindex toolbars
15996 17693
16001 17698
16002 @node Future Work -- Easier Toolbar Customization, Future Work -- Toolbar Interface Changes, Future Work -- Toolbars, Future Work -- Toolbars 17699 @node Future Work -- Easier Toolbar Customization, Future Work -- Toolbar Interface Changes, Future Work -- Toolbars, Future Work -- Toolbars
16003 @subsection Future Work -- Easier Toolbar Customization 17700 @subsection Future Work -- Easier Toolbar Customization
16004 @cindex future work, easier toolbar customization 17701 @cindex future work, easier toolbar customization
16005 @cindex easier toolbar customization, future work 17702 @cindex easier toolbar customization, future work
17703
17704 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16006 17705
16007 @strong{Abstract:} One of XEmacs' greatest strengths is its ability to 17706 @strong{Abstract:} One of XEmacs' greatest strengths is its ability to
16008 be customized endlessly. Unfortunately, it is often too difficult to 17707 be customized endlessly. Unfortunately, it is often too difficult to
16009 figure out how to do this. There has been some recent work like the 17708 figure out how to do this. There has been some recent work like the
16010 Custom package, which helps in this regard, but I think there's a lot 17709 Custom package, which helps in this regard, but I think there's a lot
16053 ones, would be the ability to change the font size of the captions. I'm 17752 ones, would be the ability to change the font size of the captions. I'm
16054 sure that Kyle, for one, would appreciate this. 17753 sure that Kyle, for one, would appreciate this.
16055 17754
16056 (This is incomplete.....) 17755 (This is incomplete.....)
16057 17756
16058 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16059
16060 @node Future Work -- Toolbar Interface Changes, , Future Work -- Easier Toolbar Customization, Future Work -- Toolbars 17757 @node Future Work -- Toolbar Interface Changes, , Future Work -- Easier Toolbar Customization, Future Work -- Toolbars
16061 @subsection Future Work -- Toolbar Interface Changes 17758 @subsection Future Work -- Toolbar Interface Changes
16062 @cindex future work, toolbar interface changes 17759 @cindex future work, toolbar interface changes
16063 @cindex toolbar interface changes, future work 17760 @cindex toolbar interface changes, future work
17761
17762 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16064 17763
16065 I propose changing the way that toolbars are specified to make them more 17764 I propose changing the way that toolbars are specified to make them more
16066 flexible. 17765 flexible.
16067 17766
16068 @enumerate 17767 @enumerate
16205 @node Future Work -- Menu API Changes, Future Work -- Removal of Misc-User Event Type, Future Work -- Toolbars, Future Work 17904 @node Future Work -- Menu API Changes, Future Work -- Removal of Misc-User Event Type, Future Work -- Toolbars, Future Work
16206 @section Future Work -- Menu API Changes 17905 @section Future Work -- Menu API Changes
16207 @cindex future work, menu API changes 17906 @cindex future work, menu API changes
16208 @cindex menu API changes, future work 17907 @cindex menu API changes, future work
16209 17908
17909 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16210 17910
16211 @enumerate 17911 @enumerate
16212 @item 17912 @item
16213 17913
16214 I propose making a specifier for the menubar associated with the frame. 17914 I propose making a specifier for the menubar associated with the frame.
16258 properties may not actually be implemented at first, but at least the 17958 properties may not actually be implemented at first, but at least the
16259 keywords for them should be defined. 17959 keywords for them should be defined.
16260 17960
16261 @end enumerate 17961 @end enumerate
16262 17962
16263 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16264
16265 @node Future Work -- Removal of Misc-User Event Type, Future Work -- Mouse Pointer, Future Work -- Menu API Changes, Future Work 17963 @node Future Work -- Removal of Misc-User Event Type, Future Work -- Mouse Pointer, Future Work -- Menu API Changes, Future Work
16266 @section Future Work -- Removal of Misc-User Event Type 17964 @section Future Work -- Removal of Misc-User Event Type
16267 @cindex future work, removal of misc-user event type 17965 @cindex future work, removal of misc-user event type
16268 @cindex removal of misc-user event type, future work 17966 @cindex removal of misc-user event type, future work
17967
17968 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16269 17969
16270 @strong{Abstract:} This page describes why the misc-user event type 17970 @strong{Abstract:} This page describes why the misc-user event type
16271 should be split up into a number of different event types, and how to do 17971 should be split up into a number of different event types, and how to do
16272 this. 17972 this.
16273 17973
16312 @node Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Busy Pointer, Future Work -- Mouse Pointer, Future Work -- Mouse Pointer 18012 @node Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Busy Pointer, Future Work -- Mouse Pointer, Future Work -- Mouse Pointer
16313 @subsection Future Work -- Abstracted Mouse Pointer Interface 18013 @subsection Future Work -- Abstracted Mouse Pointer Interface
16314 @cindex future work, abstracted mouse pointer interface 18014 @cindex future work, abstracted mouse pointer interface
16315 @cindex abstracted mouse pointer interface, future work 18015 @cindex abstracted mouse pointer interface, future work
16316 18016
18017 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18018
16317 @strong{Abstract:} We need to create a new image format that allows 18019 @strong{Abstract:} We need to create a new image format that allows
16318 standard pointer shapes to be specified in a way that works on all 18020 standard pointer shapes to be specified in a way that works on all
16319 Windows systems. I suggest that this be called @code{pointer}, which 18021 Windows systems. I suggest that this be called @code{pointer}, which
16320 has one tag associated with it, named @code{:data}, and whose value is a 18022 has one tag associated with it, named @code{:data}, and whose value is a
16321 string. The possible strings that can be specified here are predefined 18023 string. The possible strings that can be specified here are predefined
16336 be @code{mswindows-resource}. At least in the case of 18038 be @code{mswindows-resource}. At least in the case of
16337 @code{cursor-font}, the old value should be maintained for compatibility 18039 @code{cursor-font}, the old value should be maintained for compatibility
16338 as an obsolete alias. The @code{resource} format was added so recently 18040 as an obsolete alias. The @code{resource} format was added so recently
16339 that it's possible that we can just change it. 18041 that it's possible that we can just change it.
16340 18042
16341 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16342
16343 @node Future Work -- Busy Pointer, , Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Mouse Pointer 18043 @node Future Work -- Busy Pointer, , Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Mouse Pointer
16344 @subsection Future Work -- Busy Pointer 18044 @subsection Future Work -- Busy Pointer
16345 @cindex future work, busy pointer 18045 @cindex future work, busy pointer
16346 @cindex busy pointer, future work 18046 @cindex busy pointer, future work
18047
18048 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16347 18049
16348 Automatically make the mouse pointer switch to a busy shape (watch 18050 Automatically make the mouse pointer switch to a busy shape (watch
16349 signal) when XEmacs has been "busy" for more than, e.g. 2 seconds. 18051 signal) when XEmacs has been "busy" for more than, e.g. 2 seconds.
16350 Define the @dfn{busy time} as the time since the last time that XEmacs was 18052 Define the @dfn{busy time} as the time since the last time that XEmacs was
16351 ready to receive input from the user. An implementation might be: 18053 ready to receive input from the user. An implementation might be:
16393 @node Future Work -- Everything should obey duplicable extents, , Future Work -- Extents, Future Work -- Extents 18095 @node Future Work -- Everything should obey duplicable extents, , Future Work -- Extents, Future Work -- Extents
16394 @subsection Future Work -- Everything should obey duplicable extents 18096 @subsection Future Work -- Everything should obey duplicable extents
16395 @cindex future work, everything should obey duplicable extents 18097 @cindex future work, everything should obey duplicable extents
16396 @cindex everything should obey duplicable extents, future work 18098 @cindex everything should obey duplicable extents, future work
16397 18099
18100 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18101
16398 A lot of functions don't properly track duplicable extents. For 18102 A lot of functions don't properly track duplicable extents. For
16399 example, the @code{concat} function does, but the @code{format} function 18103 example, the @code{concat} function does, but the @code{format} function
16400 does not, and extents in keymap prompts are not displayed either. All 18104 does not, and extents in keymap prompts are not displayed either. All
16401 of the functions that generate strings or string-like entities should 18105 of the functions that generate strings or string-like entities should
16402 track the extents that are associated with the strings. Currently this 18106 track the extents that are associated with the strings. Currently this
16423 a Lisp string into a @code{lisp_string_struct}. However, there is 18127 a Lisp string into a @code{lisp_string_struct}. However, there is
16424 already a function @code{copy_string_extents()} that does basically this 18128 already a function @code{copy_string_extents()} that does basically this
16425 exact thing, and it should be easy to create a modified version of this 18129 exact thing, and it should be easy to create a modified version of this
16426 function. 18130 function.
16427 18131
16428 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16429
16430 @node Future Work -- Version Number and Development Tree Organization, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Extents, Future Work 18132 @node Future Work -- Version Number and Development Tree Organization, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Extents, Future Work
16431 @section Future Work -- Version Number and Development Tree Organization 18133 @section Future Work -- Version Number and Development Tree Organization
16432 @cindex future work, version number and development tree organization 18134 @cindex future work, version number and development tree organization
16433 @cindex version number and development tree organization, future work 18135 @cindex version number and development tree organization, future work
18136
18137 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16434 18138
16435 @strong{Abstract:} The purpose of this proposal is to present a coherent 18139 @strong{Abstract:} The purpose of this proposal is to present a coherent
16436 plan for how development branches in XEmacs are managed. This will 18140 plan for how development branches in XEmacs are managed. This will
16437 cover such issues as stable versus experimental branches, creating new 18141 cover such issues as stable versus experimental branches, creating new
16438 branches, synchronizing patches between branches, and how version 18142 branches, synchronizing patches between branches, and how version
16724 without the diff getting cluttered up by these code cleanliness changes 18428 without the diff getting cluttered up by these code cleanliness changes
16725 that don't change any actual behavior. 18429 that don't change any actual behavior.
16726 18430
16727 @end enumerate 18431 @end enumerate
16728 18432
16729 @uref{../../www.666.com/ben,Ben Wing}
16730
16731 @node Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Keybindings, Future Work -- Version Number and Development Tree Organization, Future Work 18433 @node Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Keybindings, Future Work -- Version Number and Development Tree Organization, Future Work
16732 @section Future Work -- Improvements to the @code{xemacs.org} Website 18434 @section Future Work -- Improvements to the @code{xemacs.org} Website
16733 @cindex future work, improvements to the @code{xemacs.org} website 18435 @cindex future work, improvements to the @code{xemacs.org} website
16734 @cindex improvements to the @code{xemacs.org} website, future work 18436 @cindex improvements to the @code{xemacs.org} website, future work
18437
18438 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16735 18439
16736 The @code{xemacs.org} web site is the face that XEmacs presents to the 18440 The @code{xemacs.org} web site is the face that XEmacs presents to the
16737 outside world. In my opinion, its most important function is to present 18441 outside world. In my opinion, its most important function is to present
16738 information about XEmacs in such a way that solicits new XEmacs users 18442 information about XEmacs in such a way that solicits new XEmacs users
16739 and co-contributors. Existing members of the XEmacs community can 18443 and co-contributors. Existing members of the XEmacs community can
16829 at @uref{../../www.freshmeat.net/default.htm,http://www.freshmeat.net}, 18533 at @uref{../../www.freshmeat.net/default.htm,http://www.freshmeat.net},
16830 the various announcement news groups (for example, 18534 the various announcement news groups (for example,
16831 @uref{news:comp.os.linux.announce,comp.os.linux.announce}, and the 18535 @uref{news:comp.os.linux.announce,comp.os.linux.announce}, and the
16832 Windows announcement news group) etc. 18536 Windows announcement news group) etc.
16833 18537
16834 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16835
16836 @node Future Work -- Keybindings, Future Work -- Byte Code Snippets, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work 18538 @node Future Work -- Keybindings, Future Work -- Byte Code Snippets, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work
16837 @section Future Work -- Keybindings 18539 @section Future Work -- Keybindings
16838 @cindex future work, keybindings 18540 @cindex future work, keybindings
16839 @cindex keybindings, future work 18541 @cindex keybindings, future work
16840 18542
16846 18548
16847 @node Future Work -- Keybinding Schemes, Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings, Future Work -- Keybindings 18549 @node Future Work -- Keybinding Schemes, Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings, Future Work -- Keybindings
16848 @subsection Future Work -- Keybinding Schemes 18550 @subsection Future Work -- Keybinding Schemes
16849 @cindex future work, keybinding schemes 18551 @cindex future work, keybinding schemes
16850 @cindex keybinding schemes, future work 18552 @cindex keybinding schemes, future work
18553
18554 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16851 18555
16852 @strong{Abstract:} We need a standard mechanism that allows a different 18556 @strong{Abstract:} We need a standard mechanism that allows a different
16853 global key binding schemes to be defined. Ideally, this would be the 18557 global key binding schemes to be defined. Ideally, this would be the
16854 @uref{keyboard-actions.html,keyboard action interface} that I have 18558 @uref{keyboard-actions.html,keyboard action interface} that I have
16855 proposed, however this would require a lot of work on the part of mode 18559 proposed, however this would require a lot of work on the part of mode
16864 18568
16865 @node Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Misc Key Binding Ideas, Future Work -- Keybinding Schemes, Future Work -- Keybindings 18569 @node Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Misc Key Binding Ideas, Future Work -- Keybinding Schemes, Future Work -- Keybindings
16866 @subsection Future Work -- Better Support for Windows Style Key Bindings 18570 @subsection Future Work -- Better Support for Windows Style Key Bindings
16867 @cindex future work, better support for windows style key bindings 18571 @cindex future work, better support for windows style key bindings
16868 @cindex better support for windows style key bindings, future work 18572 @cindex better support for windows style key bindings, future work
18573
18574 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16869 18575
16870 @strong{Abstract:} This page describes how we could create an XEmacs 18576 @strong{Abstract:} This page describes how we could create an XEmacs
16871 extension that modifies the global key bindings so that a Windows user 18577 extension that modifies the global key bindings so that a Windows user
16872 would feel at home when using the keyboard in XEmacs. Some of these 18578 would feel at home when using the keyboard in XEmacs. Some of these
16873 bindings don't conflict with standard XEmacs keybindings and should be 18579 bindings don't conflict with standard XEmacs keybindings and should be
16931 allows the user to make a selection of which key binding scheme they 18637 allows the user to make a selection of which key binding scheme they
16932 would prefer as the default, either the XEmacs standard bindings, Vi 18638 would prefer as the default, either the XEmacs standard bindings, Vi
16933 bindings (which would be Viper mode), Windows-style bindings, Brief, 18639 bindings (which would be Viper mode), Windows-style bindings, Brief,
16934 CodeWright, Visual C++, or whatever we manage to implement. 18640 CodeWright, Visual C++, or whatever we manage to implement.
16935 18641
16936 @uref{../../www.666.com/ben/default.htm,Ben Wing}
16937
16938 @node Future Work -- Misc Key Binding Ideas, , Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings 18642 @node Future Work -- Misc Key Binding Ideas, , Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings
16939 @subsection Future Work -- Misc Key Binding Ideas 18643 @subsection Future Work -- Misc Key Binding Ideas
16940 @cindex future work, misc key binding ideas 18644 @cindex future work, misc key binding ideas
16941 @cindex misc key binding ideas, future work 18645 @cindex misc key binding ideas, future work
18646
18647 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
16942 18648
16943 @itemize 18649 @itemize
16944 @item 18650 @item
16945 M-123 ... do digit arg 18651 M-123 ... do digit arg
16946 18652
16998 @node Future Work -- Byte Code Snippets, Future Work -- Lisp Stream API, Future Work -- Keybindings, Future Work 18704 @node Future Work -- Byte Code Snippets, Future Work -- Lisp Stream API, Future Work -- Keybindings, Future Work
16999 @section Future Work -- Byte Code Snippets 18705 @section Future Work -- Byte Code Snippets
17000 @cindex future work, byte code snippets 18706 @cindex future work, byte code snippets
17001 @cindex byte code snippets, future work 18707 @cindex byte code snippets, future work
17002 18708
18709 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18710
17003 @itemize 18711 @itemize
17004 @item 18712 @item
17005 For use in time critical (e.g. redisplay) places such as display 18713 For use in time critical (e.g. redisplay) places such as display
17006 tables - a simple piece of code is evalled, e.g. 18714 tables - a simple piece of code is evalled, e.g.
17007 @example 18715 @example
17027 @end itemize 18735 @end itemize
17028 18736
17029 @menu 18737 @menu
17030 * Future Work -- Autodetection:: 18738 * Future Work -- Autodetection::
17031 * Future Work -- Conversion Error Detection:: 18739 * Future Work -- Conversion Error Detection::
18740 * Future Work -- Unicode::
17032 * Future Work -- BIDI Support:: 18741 * Future Work -- BIDI Support::
17033 * Future Work -- Localized Text/Messages:: 18742 * Future Work -- Localized Text/Messages::
17034 @end menu 18743 @end menu
17035 18744
17036 @node Future Work -- Autodetection, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets, Future Work -- Byte Code Snippets 18745 @node Future Work -- Autodetection, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets, Future Work -- Byte Code Snippets
17038 @cindex future work, autodetection 18747 @cindex future work, autodetection
17039 @cindex autodetection, future work 18748 @cindex autodetection, future work
17040 18749
17041 There are various proposals contained here. 18750 There are various proposals contained here.
17042 18751
17043 @subsection New Implementation of Autodetection Mechanism 18752 @subheading New Implementation of Autodetection Mechanism
18753
18754 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17044 18755
17045 The current auto detection mechanism in XEmacs Mule has many 18756 The current auto detection mechanism in XEmacs Mule has many
17046 problems. For one thing, it is wrong too much of the time. Another 18757 problems. For one thing, it is wrong too much of the time. Another
17047 problem, although easily fixed, is that priority lists are fixed rather 18758 problem, although easily fixed, is that priority lists are fixed rather
17048 than varying, depending on the particular locale; and finally, it 18759 than varying, depending on the particular locale; and finally, it
17181 As part of the "are you sure" dialog box or question, the user can 18892 As part of the "are you sure" dialog box or question, the user can
17182 display the results of the decoding to make sure it's correct. If the 18893 display the results of the decoding to make sure it's correct. If the
17183 user says "no, they're not sure," then the same list of choices as 18894 user says "no, they're not sure," then the same list of choices as
17184 previously mentioned will be presented. 18895 previously mentioned will be presented.
17185 18896
17186 @subheading Implementation of Coding System Priority Lists in Various Locales 18897 @subheading RFC: Autodetection
18898
18899 Also appeared under heading "Implementation of Coding System Priority
18900 Lists in Various Locales" ?
18901
18902 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull}
18903
18904 Date: 11/1/1999 2:48 AM
17187 18905
17188 @example 18906 @example
18907 >>>>> "Hrvoje" == Hrvoje Niksic <hniksic@@srce.hr> writes:
18908
18909 [Ben sez:]
18910
18911 >> You are perfectly free to set up your XEmacs like this, but
18912 >> XEmacs/Mule @strong{will} autodetect by default if there is no
18913 >> Content-Type: info and no reason to believe we are dealing with
18914 >> binary files.
18915
18916 Hrvoje> In that case, it will be a serious mistake to make
18917 Hrvoje> --with-mule the default, ever. I think more care should
18918 Hrvoje> be shown in meeting the need of European users.
18919 @end example
18920
18921 Hrvoje, I don't understand what you are worrying about. I suspect you
18922 are worrying about Handa's hyperactive and obstinate Mule, not what
18923 Ben has in mind. Yes, Ben has said "better guessing," but that's
18924 simply not reasonable without substantial language environment
18925 information. I think trying to detect Latin-1 vs Latin-2 in the POSIX
18926 locale would be a big mistake, I think trying to guess Big 5 v. Shift
18927 JIS in a European locale would be a big mistake.
18928
18929 If Ben doesn't mean "more appropriate use of language environment
18930 information" when he writes "better guessing," I, as much as you, want
18931 to see how he plans to do that. Ben? ("Yes/no/oops I need to think
18932 about it" is good enough if you have specifics you intend to put in
18933 the RFC you're planning to present.)
18934
18935 Let me give a formal proposal of what I would like to see in the
18936 autodetection specification.
18937
17189 @enumerate 18938 @enumerate
18939 @item
18940 Definitions
18941
18942 @enumerate
18943 @item
18944 @dfn{Autodetection} means detecting and making available to Mule
18945 the external file's encoding. See (5), below. It doesn't
18946 imply any specific actions based on that information.
18947
18948 @item
18949 The @dfn{default} case is POSIX locale, and no environment
18950 information in ~/.emacs.
18951
18952 N.B. This @strong{will} cause breakage for all 1-byte users because
18953 the default case can no longer assume Latin-1. You @strong{may} be
18954 able to use the TTY font or the Xt -font option to fake this,
18955 and default to iso8859-1; I would hope that we would not use
18956 such a kludge in the beta versions, although it might be
18957 satisfactory for general use. In particular, encodings like
18958 VISCII (Vietnamese) and I believe KOI-8 (Cyrillic) are not
18959 ISO-2022-clean, but using C1 control characters as a heuristic
18960 for detecting binary files is useful.
18961
18962 If we do allow it, I think that XEmacs should bitch and warn
18963 that the practices of implicitly specifying language
18964 environment by -font and defaulting on TTYs is deprecated and
18965 likely to be obsoleted.
18966
18967 @item
18968 The @dfn{European} case is any Latin-* locale, either implied by
18969 setlocale() and friends or set in ~/.emacs. Latin-1 is
18970 specifically not given precedence over other Latin-*, or
18971 non-Latin or non-ISO-8859 for that matter. I suspect but am
18972 not sure that this case extends to all ISO-8859 encodings, and
18973 possibly to non-ISO-8859 single-byte encodings like KOI-8r (in
18974 particular when combined in a class with ISO-8859 encodings).
18975
18976 @item
18977 The @dfn{CJK} case is any CJK locale. Japanese is specifically
18978 not given precedence over other Asian locales.
18979
18980 @item
18981 For completeness, define the @dfn{Unicode} case (Unicode
18982 unfortunately has lots of junk such as precomposed characters,
18983 language tags, and directionality indicators in it; we
18984 probably don't care yet, but we should also not claim
18985 compliance) and the @dfn{general} case (which has a lot of
18986 features similar to Unicode, but lacks the advantage of a
18987 unified encoding). This proposal has no idea how to handle
18988 the special features of these, or even if that matters. The
18989 general case includes stuff that nobody here really knows how
18990 it works, like Tibetan and Ethiopic.
18991 @end enumerate
18992
18993 Each of the following cases is given in the order of priority of
18994 detection. I'm not sure I'm serious about the top priority given the
18995 (optional) Unicode detection. This may be appropriate if Ben is
18996 right that ISO-2022 is going to disappear, but possibly not until then
18997 (two two-byte sequences out of 65536 is probably 1.99 too many). It
18998 probably isn't too risky if (6)(c) is taken pretty seriously; a Unicode
18999 file should contain _no_ private use characters unless the encoding is
19000 explicitly specified, and that's a block of 1/10 of the code space,
19001 which should help a lot in detecting binary files.
19002
17190 @item 19003 @item
17191 Default locale 19004 Default locale
17192 19005
17193 @enumerate 19006 @enumerate
17194 @item 19007 @item
17291 Newlines will be detected in text files. 19104 Newlines will be detected in text files.
17292 @end enumerate 19105 @end enumerate
17293 19106
17294 @item 19107 @item
17295 Unicode and general locales; multilingual use 19108 Unicode and general locales; multilingual use
17296 @end enumerate
17297 19109
17298 @enumerate 19110 @enumerate
17299 @item 19111 @item
17300 Hopefully a system general enough to handle (2)--(4) will 19112 Hopefully a system general enough to handle (2)--(4) will
17301 handle these, too, but we should watch out for gotchas like 19113 handle these, too, but we should watch out for gotchas like
17311 would involve (eg) heuristics like picking a set of code 19123 would involve (eg) heuristics like picking a set of code
17312 points that are frequent in Shift JIS and uncommon in Big 5 19124 points that are frequent in Shift JIS and uncommon in Big 5
17313 and betting that a file containing many characters from that 19125 and betting that a file containing many characters from that
17314 set is Shift JIS. 19126 set is Shift JIS.
17315 @end enumerate 19127 @end enumerate
17316 @end example 19128
19129 @item
19130 Relationship to decoding semantics
19131
19132 @enumerate
19133 @item
19134 Autodetection should be run on every input stream unless the
19135 user explicitly disables it.
19136
19137 @item
19138 The (conceptual) default procedure is
19139
19140 @item
19141 Read the file into the buffer
19142
19143 Announce the result of autodetection to the user.
19144
19145 User may request decoding, with autodetected encoding(s)
19146 given priority in a list of available encodings.
19147
19148 zations (see (e) below) should avoid introducing data
19149 tion that this default procedure would avoid.
19150
19151 sly, it can't be perfect if any autodecoding is done;
19152 like Hrvoje should have an easily available option to
19153 to this default (or an optimized approximation which
19154 t actually read the whole file into a buffer) or simply
19155 y everything as binary (with the "font" for binary files
19156 a user option).
19157
19158 @item
19159 This implies that we should be detecting conditions in the
19160 tail of the file which violate the implicit assumptions of the
19161 coding system autodetected (eg, in UTF-8 illegal UTF-8
19162 sequences, including those corresponding to surrogates) should
19163 raise a warning; the buffer should probably be made read-only
19164 and the user prompted.
19165
19166 This could be taken to extremes, like checking by table
19167 whether all characters in a Japanese file are actually
19168 legitimate JIS codes; that's insane (and would cause corporate
19169 encodings to be recognized as binary). But we should think
19170 about the idea that autodetection shouldn't mean XEmacs can't
19171 change its mind.
19172
19173 @item
19174 A flexible means for the user to delegate the decision
19175 (conditional on the result of autodetection) to decode or not
19176 to XEmacs or a Lisp program should be provided (eg, the
19177 coding priority list and/or a file-coding-alist).
19178
19179 @item
19180 Optimized operations (eg, the current lstreams) should be
19181 provided, with the recognition that if they depend on sampling
19182 the file they are risky.
19183
19184 @item
19185 Mule should provide a reasonable set of default delegations
19186 (as in (d) above) for as many locales as possible.
19187 @end enumerate
19188
19189 @item
19190 Implementation
19191
19192 @enumerate
19193 @item
19194 I think all the decision logic suggested above can be
19195 accomplished through a coding-priority-list and appropriate
19196 initializations for different language environments, and a
19197 file-coding-alist.
19198
19199 @item
19200 Many of the tests on the file's tail shouldn't be very
19201 expensive; in particular, all of the ones I've suggested are
19202 O(n) although they might involve moderate-sized auxiliary
19203 tables for efficiency (eg, 64kB for a single Unicode-oriented
19204 test).
19205 @end enumerate
19206 @end enumerate
19207
19208 Other comments:
19209
19210 It might be reasonable given Hrvoje's objections to require that any
19211 autodetection that could cause data loss (any coding system that
19212 involves escape sequences, and only those AFAIK: by design translation
19213 to Unicode is invertible) by default prompt the user (presumable with
19214 a novice-like ability to retain the prompt, always default to binary,
19215 or always default to the autodetected encoding) in the future, at
19216 least in locales that don't need it (POSIX, Latin-any).
19217
19218 Ben thinks that we can remember the input data; I think it's going to
19219 be hard to comprehensively test that a highly optimized version works.
19220 Good design will help, but ISO-2022 is enormously complex, and there
19221 are many encodings that violate even its lax assumptions. On the
19222 other hand, memory is the only way to get non-rewindable streams right.
19223
19224 Hrvoje himself said he would like to have an XEmacs that distinguishes
19225 between Latin-1 and Latin-2 text. Where it is possible to do that,
19226 this is exactly what autodetection of ISO-2022 and Unicode gives you.
19227 Many people would want that, even at some risk of binary corruption.
19228
19229 >> Once again I remind you that XEmacs is a @strong{text} editor. There
19230 >> are lots of files that potentially may have Japanese etc. in
19231 >> them without this marked, e.g. C or Elisp files in the XEmacs
19232 >> source. Surely you're not arguing that we interpret even these
19233 >> files as binary by default?
19234
19235 Hrvoje> I am. If I want to see Japanese, I'll setup my
19236 Hrvoje> environment that way. But I don't, and neither do 99% of
19237 Hrvoje> Croatian users. I can't speak for French, Italian, and
19238 Hrvoje> others, but I'd assume similar.
19239
19240 Hrvoje> If there is Japanese in the source files, I will see it as
19241 Hrvoje> escape sequences, which is perfectly fine, because I don't
19242 Hrvoje> read Japanese.
19243
19244 And some (European) people will have their terminals scrambled,
19245 because Shift-JIS contains sequences that can change the state of
19246 XTerm (as do fixed-width Unicode and Big5). This may also be a
19247 problem with some Windows-12xx encodings; I'm not sure they all are
19248 ISO-2022-clean. (This isn't a problem for XEmacs native X11 frames or
19249 native MS-Windows frames, and the XEmacs sources themselves are all in
19250 7-bit ISO-2022 now IIRC. But it is a potential source of great
19251 frustration for many users.)
19252
19253 I think that should be considered too, although it is presumably lower
19254 priority than the data corruption of binary files.
19255
19256 @subheading Response to RFC: Autodetection
19257
19258 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
19259
19260 Date: 11/1/1999 7:24 AM
19261
19262 Stephen, thank you very much for writing this up. I think it is a good start,
19263 and definitely moving in the direction I would like to see things going: more
19264 proposals, less arguing. (aka "more light, less heat") However, I have some
19265 suggestions for cleaning this up:
19266
19267 You should try to make it more layered. For example, you might have one
19268 section devoted to the workings of autodetection, which starts out like this
19269 (the section numbers below are totally arbitrary):
19270
19271 @subsubheading Section 5
19272
19273 @code{Autodetect()} is a function whose arguments are (1) a readable stream, (2) some
19274 hints indicating how the autodetection is to proceed, and (3) a value
19275 indicating the maximum number of characters to examine at the beginning of the
19276 stream. (Possibly, the value in (3) may be some special symbol indicating
19277 that we only go as far as the next line, or a certain number of lines ahead;
19278 this would be used as part of "continuous autodetection", e.g. we are decoding
19279 the results of an interactive terminal session, where the user may
19280 periodically switch encodings, line terminations, etc. as different programs
19281 get run and/or telnet or similar sessions are entered into and exited.) We
19282 assume the stream is rewindable; if not, insert a "rewinding" stream in front
19283 of the non-rewinding stream; this kind of stream automatically buffers the
19284 data as necessary.
19285 [You can use pseudo-code terminology here. No need for straight C or ELisp.]
19286 [Then proceed to describe what the hints look like -- e.g. you could portray
19287 it as a property list or whatever. The idea is that, for each locale, there
19288 is a corresponding hints value that is used at least by default. The hints
19289 structure also has to be set up to allow for two or more competing hints
19290 specifications to be merged together. For example, the extension of a file
19291 might provide an additional hint or hints about how to interpret the data of
19292 that file, and the caller of @code{autodetect()}, when calling @code{autodetect()} on such a
19293 file, would need to have a way of gracefully merging the default hints
19294 corresponding to the locale with the more specific hints provided by the
19295 extension. Furthermore, users like Hrvoje might well want to provide their
19296 own hints to supplement and override parts of the generic hints -- e.g. "I
19297 don't ever want to see non-European encodings decoded; treat them as binary
19298 instead".]
19299 [Then describe algorithmically how the autodetection works. First, you could
19300 describe it more generally, i.e. presenting an algorithmic overview, then you
19301 could discuss in detail exactly how autodetection of a particular type of
19302 external encoding works -- e.g. "for iso2022, we first look for an escape
19303 character, followed by a byte in this range [. ... .] etc."]
19304
19305 @subsubheading Section 6
19306
19307 This section describes the concept of a locale in XEmacs, and how it is
19308 derived from the user's environment. A locale in XEmacs is a pair, a country
19309 and a language, together determining the handling of locale-specific areas of
19310 XEmacs. All locale-specific areas in XEmacs make use of this XEmacs locale,
19311 and do not attempt to derive the locale from any other sources. The user is
19312 free to change the current locale at any time; accessor and mutator functions
19313 are provided to do this so that various locale-specific areas can optionally
19314 be changed together with it.
19315
19316 [Then you describe how the XEmacs locale is extracted from .emacs, from
19317 @code{setlocale()}, from the LANG environment variables, from -font, or wherever
19318 else. All other sections assume this dirty work is done and never even
19319 mention it]
19320
19321 @subsubheading Section 7
19322
19323 [Here you describe the default @code{autodetect()} hints value corresponding to each
19324 possible locale. You should probably use a schematic description here, e.g.
19325 an actual Lisp property list, liberally commented.]
19326
19327 @subsubheading Section 8 etc.
19328
19329 [Other sections cover anything I've missed. By being very careful to separate
19330 out the layers, you simultaneously introduce more rigor (easier to catch bugs)
19331 and make it easier for someone else to understand it completely.]
17317 19332
17318 @subheading Better Algorithm, More Flexibility, Different Levels of Certainty 19333 @subheading Better Algorithm, More Flexibility, Different Levels of Certainty
17319 19334
17320 @subheading Much More Flexible Coding System Priority List, per-Language Environment 19335 @subheading Much More Flexible Coding System Priority List, per-Language Environment
17321 19336
17322 @subheading User Ability to Select Encoding when System Unsure or Encounters Errors 19337 @subheading User Ability to Select Encoding when System Unsure or Encounters Errors
17323 19338
17324 @subheading Another Autodetection Proposal 19339 @subheading Another Autodetection Proposal
19340
19341 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17325 19342
17326 however, in general the detection code has major problems and needs lots 19343 however, in general the detection code has major problems and needs lots
17327 of work: 19344 of work:
17328 19345
17329 @itemize @bullet 19346 @itemize @bullet
17330 @item 19347 @item
17331 instead of merely "yes" or "no" for particular categories, we need a 19348 instead of merely "yes" or "no" for particular categories, we need a
17332 more flexible system, with various levels of likelihood. Currently 19349 more flexible system, with various levels of likelihood. Currently
17333 I've created a system with six levels, as follows: 19350 I've created a system with six levels, as follows:
17334 19351
17335 [see file-coding.h] 19352 [see @file{file-coding.h}]
17336 19353
17337 Let's consider what this might mean for an ASCII text detector. (In 19354 Let's consider what this might mean for an ASCII text detector. (In
17338 order to have accurate detection, especially given the iteration I 19355 order to have accurate detection, especially given the iteration I
17339 proposed below, we need active detectors for @strong{all} types of data we 19356 proposed below, we need active detectors for @strong{all} types of data we
17340 might reasonably encounter, such as ASCII text files, binary files, 19357 might reasonably encounter, such as ASCII text files, binary files,
17499 19516
17500 ben [at least that's what sjt thinks] 19517 ben [at least that's what sjt thinks]
17501 19518
17502 ***** 19519 *****
17503 19520
19521 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull}
19522
17504 While this is clearly something of an improvement over earlier designs, 19523 While this is clearly something of an improvement over earlier designs,
17505 it doesn't deal with the most important issue: to do better than categories 19524 it doesn't deal with the most important issue: to do better than categories
17506 (which in the medium term is mostly going to mean "which flavor of Unicode 19525 (which in the medium term is mostly going to mean "which flavor of Unicode
17507 is this?"), we need to look at statistical behavior rather than ruling out 19526 is this?"), we need to look at statistical behavior rather than ruling out
17508 categories via presence of specific sequences. This means the stream 19527 categories via presence of specific sequences. This means the stream
17525 and "magic" like Unicode signatures or file(1) magic. 19544 and "magic" like Unicode signatures or file(1) magic.
17526 @end enumerate 19545 @end enumerate
17527 19546
17528 --sjt 19547 --sjt
17529 19548
17530 @node Future Work -- Conversion Error Detection, Future Work -- BIDI Support, Future Work -- Autodetection, Future Work -- Byte Code Snippets 19549 @node Future Work -- Conversion Error Detection, Future Work -- Unicode, Future Work -- Autodetection, Future Work -- Byte Code Snippets
17531 @subsection Future Work -- Conversion Error Detection 19550 @subsection Future Work -- Conversion Error Detection
17532 @cindex future work, conversion error detection 19551 @cindex future work, conversion error detection
17533 @cindex conversion error detection, future work 19552 @cindex conversion error detection, future work
17534 19553
17535 @subheading "No Corruption" Scheme for Preserving External Encoding when Non-Invertible Transformation Applied 19554 @subheading "No Corruption" Scheme for Preserving External Encoding when Non-Invertible Transformation Applied
19555
19556 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17536 19557
17537 A preliminary and simple implementation is: 19558 A preliminary and simple implementation is:
17538 19559
17539 @quotation 19560 @quotation
17540 But you could implement it much more simply and usefully by just 19561 But you could implement it much more simply and usefully by just
17599 correspondences to get the internal state right. 19620 correspondences to get the internal state right.
17600 @end enumerate 19621 @end enumerate
17601 @end quotation 19622 @end quotation
17602 19623
17603 @subheading Another Error-Catching Idea 19624 @subheading Another Error-Catching Idea
19625
19626 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17604 19627
17605 Nov 4, 1999 19628 Nov 4, 1999
17606 19629
17607 Finally, I don't think "save the input" is as hard as you make it out to 19630 Finally, I don't think "save the input" is as hard as you make it out to
17608 be. Conceptually, in fact, it's simple: for each minimal group of bytes 19631 be. Conceptually, in fact, it's simple: for each minimal group of bytes
17619 cases. The hardest part, in fact, is making all the string/text 19642 cases. The hardest part, in fact, is making all the string/text
17620 handling in XEmacs be robust w.r.t. text properties. 19643 handling in XEmacs be robust w.r.t. text properties.
17621 19644
17622 @subheading Strategies for Error Annotation and Coding Orthogonalization 19645 @subheading Strategies for Error Annotation and Coding Orthogonalization
17623 19646
17624 From sjt (?): 19647 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull}
17625 19648
17626 We really want to separate out a number of things. Conceptually, 19649 We really want to separate out a number of things. Conceptually,
17627 there is a nested syntax. 19650 there is a nested syntax.
17628 19651
17629 At the top level is the ISO 2022 extension syntax, including charset 19652 At the top level is the ISO 2022 extension syntax, including charset
17660 It's possible that, by doing the processing with tables of functions or 19683 It's possible that, by doing the processing with tables of functions or
17661 the like, the parser can be used for both detection and translation. 19684 the like, the parser can be used for both detection and translation.
17662 19685
17663 @subheading Handling Writing a File Safely, Without Data Loss 19686 @subheading Handling Writing a File Safely, Without Data Loss
17664 19687
17665 From ben: 19688 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17666 19689
17667 @quotation 19690 @quotation
17668 When writing a file, we need error detection; otherwise somebody 19691 When writing a file, we need error detection; otherwise somebody
17669 will create a Unicode file without realizing the coding system 19692 will create a Unicode file without realizing the coding system
17670 of the buffer is Raw, and then lose all the non-ASCII/Latin-1 19693 of the buffer is Raw, and then lose all the non-ASCII/Latin-1
17715 same thing (error checking, list of alternatives, etc.) needs 19738 same thing (error checking, list of alternatives, etc.) needs
17716 to happen when reading! all of this will be a lot of work! 19739 to happen when reading! all of this will be a lot of work!
17717 @end enumerate 19740 @end enumerate
17718 @end quotation 19741 @end quotation
17719 19742
17720 --ben 19743 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull}
17721 19744
17722 I don't much like Ben's scheme. First, this isn't an issue of I/O, 19745 I don't much like Ben's scheme. First, this isn't an issue of I/O,
17723 it's a coding issue. It can happen in many places, not just on stream 19746 it's a coding issue. It can happen in many places, not just on stream
17724 I/O. Error checking should take place on all translations. Second, 19747 I/O. Error checking should take place on all translations. Second,
17725 the two-pass algorithm should be avoided if possible. In some cases 19748 the two-pass algorithm should be avoided if possible. In some cases
17747 characters. So (up to some maximum) we should keep a list of unsafe 19770 characters. So (up to some maximum) we should keep a list of unsafe
17748 text positions, and provide a convenient function for traversing them. 19771 text positions, and provide a convenient function for traversing them.
17749 19772
17750 --sjt 19773 --sjt
17751 19774
17752 @node Future Work -- BIDI Support, Future Work -- Localized Text/Messages, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets 19775 @node Future Work -- Unicode, Future Work -- BIDI Support, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets
19776 @subsection Future Work -- Unicode
19777 @cindex future work, unicode
19778 @cindex unicode, future work
19779
19780 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
19781
19782 Following is an old proposal. Unicode has been implemented already, in
19783 a different fashion; but there are some ideas here for more general
19784 support, e.g. properties of Unicode characters other than their mappings
19785 to particular charsets.
19786
19787
19788 We recognize 128, [256], 128x128, [256x256] for source charsets;
19789
19790 for Unicode, 256x256 or 16x256x256.
19791
19792 In all cases, use tables of tables and substitute a default subtable
19793 if entire row is empty.
19794
19795 If destination is Unicode, either 16 or 32 bits.
19796
19797 If destination is charset, either 8 or 16 bits.
19798
19799 For the moment, since we only do 94, 96, 94x94 or 96x96, only do 128
19800 or 128x128 for source charsets and use the range 33-126 or 32-127.
19801 (Except ASCII - we special case that and have no table because we can
19802 algorithmically translate)
19803
19804 Also have a 16x256x256 table -> 32 bits of Unicode char properties.
19805
19806 A particular charset contains two associated mapping tables, for both
19807 directions.
19808
19809 API is set-unicode-mapping:
19810
19811 @example
19812 (set-unicode-mapping
19813 unicode char
19814 unicode charset-code charset-offset
19815 unicode vector of char
19816 unicode list of char
19817 unicode string of char
19818 unicode vector or list of codes charset-offset
19819 @end example
19820
19821 Establishes a mapping between a unicode codepoint (an integer) and
19822 one or more chars in a charset. The mapping is automatically
19823 established in both directions. Chars in a charset can be specified
19824 either with an actual character or a codepoint (i.e. an integer)
19825 and the charset it's within. If a sequence of chars or charset
19826 points is given, multiple mappings are established for consecutive
19827 unicode codepoints starting with the given one. Charset codepoints
19828 are specified as most-significant x 256 + least significant, with
19829 both bytes in the range 33-126 (for 94 or 94x94) or 32-127 (for 96
19830 or 96x96), unless an offset is given, which will be subtracted from
19831 each byte. (Most common values are 128, for codepoints given with
19832 the high bit set, or -32, for codepoints given as 1-94 or 0-95.)
19833
19834 Other API's:
19835
19836 @example
19837 (write-unicode-mapping file charset)
19838 @end example
19839
19840 Write the mapping table for a particular charset to the specified
19841 file. The tables are written in an internal format that allows for
19842 efficient loading, for portability across platforms and XEmacs
19843 invocations, for conserving space, for appending multiple tables one
19844 directly after another with no need for a directory anywhere in the
19845 file, and for reorganizing a file as in this format (with a magic
19846 sequence at the beginning). The data will be appended at the end of
19847 a file, so that multiple tables can be written to a file; remove the
19848 file first to avoid this.
19849
19850 @example
19851 (write-unicode-properties file unicode-codepoint length)
19852 @end example
19853
19854 Write the Unicode properties (not including charset mappings) for
19855 the specified range of contiguous Unicode codepoints to the end of
19856 the file (i.e. append mode) in a binary format similar to what was
19857 mentioned in the write-unicode-mapping description and with the same
19858 features.
19859
19860 Extension to set-unicode-mapping:
19861
19862 @example
19863 (set-unicode-mapping
19864 list-or-vector-of-unicode-codepoints char
19865 "" charset-code charset-offset
19866 "" sequence of char
19867 "" list-or-vector-of-codes
19868 charset-offset
19869 @end example
19870
19871 The first two forms are conceptually the inverse of the forms above
19872 to specify characters for a contiguous range of Unicode codepoints.
19873 These new forms let you specify the Unicode codepoints for a
19874 contiguous range of chars in a charset. "Contiguous" here means
19875 that if we run off the end of a row, we go to the first entry of the
19876 next row, rather than to an invalid code point. For example, in a
19877 94x94 charset, valid rows and columns are in the range 0x21-0x7e;
19878 after 0x457c 0x457d 4x457e goes 0x4621, not something like 0x457f,
19879 which is invalid.
19880
19881 The final two forms are the most general, letting you specify an
19882 arbitrary set of both Unicode points and charset chars, and the two
19883 are matched up just like a series of individual calls. However, if
19884 the lists or vectors do not have the same length, an error is
19885 signaled.
19886
19887 @example
19888 (load-unicode-mapping file &optional charset)
19889 @end example
19890
19891 If charset is omitted, loads all charset mapping tables found and
19892 returns a list of the charsets found. If charset is specified,
19893 searches through the file for the appropriate mapping tables. (This
19894 is extremely fast because each entry in the file gives an offset to
19895 the next one). Returns t if found.
19896
19897 @example
19898 (load-unicode-properties file unicode-codepoint)
19899 @end example
19900
19901 @example
19902 (list-unicode-entries file)
19903 @end example
19904
19905 @example
19906 (autoload-unicode-mapping charset)
19907 @end example
19908
19909 ...
19910
19911 (unfinished)
19912
19913 @node Future Work -- BIDI Support, Future Work -- Localized Text/Messages, Future Work -- Unicode, Future Work -- Byte Code Snippets
17753 @subsection Future Work -- BIDI Support 19914 @subsection Future Work -- BIDI Support
17754 @cindex future work, bidi support 19915 @cindex future work, bidi support
17755 @cindex bidi support, future work 19916 @cindex bidi support, future work
19917
19918 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
17756 19919
17757 @enumerate 19920 @enumerate
17758 @item 19921 @item
17759 Use text properties to handle nesting levels, overrides 19922 Use text properties to handle nesting levels, overrides
17760 BIDI-specific text properties (as per Unicode BIDI algorithm) 19923 BIDI-specific text properties (as per Unicode BIDI algorithm)
17812 19975
17813 (much of this comment is outdated, and a lot of it is actually 19976 (much of this comment is outdated, and a lot of it is actually
17814 implemented) 19977 implemented)
17815 19978
17816 @subsection Proposal for How This All Ought to Work 19979 @subsection Proposal for How This All Ought to Work
19980
19981 Author: @uref{mailto:jwz@@jwz.org,Jamie Zawinski}
17817 19982
17818 this isn't implemented yet, but this is the plan-in-progress 19983 this isn't implemented yet, but this is the plan-in-progress
17819 19984
17820 In general, it's accepted that the best way to internationalize is for all 19985 In general, it's accepted that the best way to internationalize is for all
17821 messages to be referred to by a symbolic name (or number) and come out of a 19986 messages to be referred to by a symbolic name (or number) and come out of a
17860 one we know how to translate, then we translate it? I think this is a 20025 one we know how to translate, then we translate it? I think this is a
17861 worthy goal. It remains to be seen how well it will work in practice. 20026 worthy goal. It remains to be seen how well it will work in practice.
17862 20027
17863 So, we should endeavor to minimize the impact on the lisp code. Certain 20028 So, we should endeavor to minimize the impact on the lisp code. Certain
17864 primitive lisp routines (the stuff in lisp/prim/, and especially in 20029 primitive lisp routines (the stuff in lisp/prim/, and especially in
17865 cmdloop.el and minibuf.el) may need to be changed to know about translation, 20030 @file{cmdloop.el} and @file{minibuf.el}) may need to be changed to know about translation,
17866 but that's an ideologically clean thing to do because those are considered 20031 but that's an ideologically clean thing to do because those are considered
17867 a part of the emacs substrate. 20032 a part of the emacs substrate.
17868 20033
17869 However, if we find ourselves wanting to make changes to, say, RMAIL, then 20034 However, if we find ourselves wanting to make changes to, say, RMAIL, then
17870 something has gone wrong. (Except to do things like remove assumptions 20035 something has gone wrong. (Except to do things like remove assumptions
17880 the translation. The new plan is to separate these two things more: the 20045 the translation. The new plan is to separate these two things more: the
17881 tags that we search for to build the catalog will be stuff that was in there 20046 tags that we search for to build the catalog will be stuff that was in there
17882 already, and the translation will get done in some more centralized, lower 20047 already, and the translation will get done in some more centralized, lower
17883 level place. 20048 level place.
17884 20049
17885 This program (make-msgfile.c) addresses the first part, extracting the 20050 This program (@file{make-msgfile.c}) addresses the first part, extracting the
17886 strings. 20051 strings.
17887 20052
17888 For the emacs C code, we need to recognize the following patterns: 20053 For the emacs C code, we need to recognize the following patterns:
17889 20054
17890 @example 20055 @example
17929 20094
17930 I expect there will be a lot like the above; basically, any function which 20095 I expect there will be a lot like the above; basically, any function which
17931 is a commonly used wrapper around an eventual call to @code{message} or 20096 is a commonly used wrapper around an eventual call to @code{message} or
17932 @code{read-from-minibuffer} needs to be recognized by this program. 20097 @code{read-from-minibuffer} needs to be recognized by this program.
17933 20098
17934
17935 @example 20099 @example
17936 (dgettext "domain-name" "string") #### do we still need this? 20100 (dgettext "domain-name" "string") #### do we still need this?
17937 20101
17938 things that should probably be restructured: 20102 things that should probably be restructured:
17939 @code{princ} in cmdloop.el 20103 @code{princ} in @file{cmdloop.el}
17940 @code{insert} in debug.el 20104 @code{insert} in @file{debug.el}
17941 face-interactive 20105 face-interactive
17942 help.el, syntax.el all messed up 20106 @file{help.el}, @file{syntax.el} all messed up
17943 @end example 20107 @end example
17944 20108
20109 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20110
17945 ben: (format) is a tricky case. If I use format to create a string 20111 ben: (format) is a tricky case. If I use format to create a string
17946 that I then send to a file, I probably don't want the string translated. 20112 that I then send to a file, I probably don't want the string translated.
17947 On the other hand, If the string gets used as an argument to (y-or-n-p) 20113 On the other hand, If the string gets used as an argument to (y-or-n-p)
17948 or some such function, I do want it translated, and it needs to be 20114 or some such function, I do want it translated, and it needs to be
17949 translated before the %s and such are replaced. The proper solution 20115 translated before the %s and such are replaced. The proper solution
18051 We can solve this by adding a bit to Lisp_String objects which identifies 20217 We can solve this by adding a bit to Lisp_String objects which identifies
18052 them as having been read as literal constants from a .el or .elc file (as 20218 them as having been read as literal constants from a .el or .elc file (as
18053 opposed to having been constructed at run time as it would in the above 20219 opposed to having been constructed at run time as it would in the above
18054 case.) To solve this: 20220 case.) To solve this:
18055 20221
18056 @example 20222 @itemize @bullet
18057 - @code{Fmessage()} takes a lisp string as its first argument. 20223 @item
18058 If that string is a constant, that is, was read from a source file 20224 @code{Fmessage()} takes a lisp string as its first argument.
18059 as a literal, then it calls @code{message()} with it, which translates. 20225 If that string is a constant, that is, was read from a source file
18060 Otherwise, it calls @code{message_no_translate()}, which does not translate. 20226 as a literal, then it calls @code{message()} with it, which translates.
18061 20227 Otherwise, it calls @code{message_no_translate()}, which does not translate.
18062 - @code{Ferror()} (actually, @code{Fsignal()} when condition is Qerror) works similarly. 20228
18063 @end example 20229 @item
20230 @code{Ferror()} (actually, @code{Fsignal()} when condition is Qerror) works similarly.
20231 @end itemize
18064 20232
18065 More specifically, we do: 20233 More specifically, we do:
18066 20234
18067 @quotation 20235 @quotation
18068 Scan specified C and Lisp files, extracting the following messages: 20236 Scan specified C and Lisp files, extracting the following messages:
18100 it might run into problems if Arg is used for other sorts 20268 it might run into problems if Arg is used for other sorts
18101 of functions. 20269 of functions.
18102 @item 20270 @item
18103 @code{snarf()} should be modified so that it doesn't output null 20271 @code{snarf()} should be modified so that it doesn't output null
18104 strings and non-textual strings (see the comment at the top 20272 strings and non-textual strings (see the comment at the top
18105 of make-msgfile.c). 20273 of @file{make-msgfile.c}).
18106 @item 20274 @item
18107 parsing of (insert) should snarf all of the arguments. 20275 parsing of (insert) should snarf all of the arguments.
18108 @item 20276 @item
18109 need to add set-keymap-prompt and deal with gettext of that. 20277 need to add set-keymap-prompt and deal with gettext of that.
18110 @item 20278 @item
18139 20307
18140 @node Future Work -- Lisp Stream API, Future Work -- Multiple Values, Future Work -- Byte Code Snippets, Future Work 20308 @node Future Work -- Lisp Stream API, Future Work -- Multiple Values, Future Work -- Byte Code Snippets, Future Work
18141 @section Future Work -- Lisp Stream API 20309 @section Future Work -- Lisp Stream API
18142 @cindex future work, Lisp stream API 20310 @cindex future work, Lisp stream API
18143 @cindex Lisp stream API, future work 20311 @cindex Lisp stream API, future work
20312
20313 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18144 20314
18145 Expose XEmacs internal lstreams to Lisp as stream objects. (In 20315 Expose XEmacs internal lstreams to Lisp as stream objects. (In
18146 addition to the functions given below, each stream object has 20316 addition to the functions given below, each stream object has
18147 properties that can be associated with it using the standard put, get 20317 properties that can be associated with it using the standard put, get
18148 etc. API. For GNU Emacs, where put and get have not been extended to 20318 etc. API. For GNU Emacs, where put and get have not been extended to
18530 @node Future Work -- Multiple Values, Future Work -- Macros, Future Work -- Lisp Stream API, Future Work 20700 @node Future Work -- Multiple Values, Future Work -- Macros, Future Work -- Lisp Stream API, Future Work
18531 @section Future Work -- Multiple Values 20701 @section Future Work -- Multiple Values
18532 @cindex future work, multiple values 20702 @cindex future work, multiple values
18533 @cindex multiple values, future work 20703 @cindex multiple values, future work
18534 20704
20705 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20706
18535 On low level, all funs that can return multiple values are defined 20707 On low level, all funs that can return multiple values are defined
18536 with DEFUN_MULTIPLE_VALUES and have an extra parameter, a struct 20708 with DEFUN_MULTIPLE_VALUES and have an extra parameter, a struct
18537 mv_context *. 20709 mv_context *.
18538 20710
18539 It has to be this way to ensure that only the fun itself, and no called 20711 It has to be this way to ensure that only the fun itself, and no called
18574 20746
18575 @node Future Work -- Macros, Future Work -- Specifiers, Future Work -- Multiple Values, Future Work 20747 @node Future Work -- Macros, Future Work -- Specifiers, Future Work -- Multiple Values, Future Work
18576 @section Future Work -- Macros 20748 @section Future Work -- Macros
18577 @cindex future work, macros 20749 @cindex future work, macros
18578 @cindex macros, future work 20750 @cindex macros, future work
20751
20752 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18579 20753
18580 @enumerate 20754 @enumerate
18581 @item 20755 @item
18582 Option to control whether beep really kills a macro execution. 20756 Option to control whether beep really kills a macro execution.
18583 @item 20757 @item
18592 20766
18593 @node Future Work -- Specifiers, Future Work -- Display Tables, Future Work -- Macros, Future Work 20767 @node Future Work -- Specifiers, Future Work -- Display Tables, Future Work -- Macros, Future Work
18594 @section Future Work -- Specifiers 20768 @section Future Work -- Specifiers
18595 @cindex future work, specifiers 20769 @cindex future work, specifiers
18596 @cindex specifiers, future work 20770 @cindex specifiers, future work
20771
20772 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18597 20773
18598 @subheading Ideas To Work On When Their Time Has Come 20774 @subheading Ideas To Work On When Their Time Has Come
18599 20775
18600 @itemize 20776 @itemize
18601 @item 20777 @item
18799 @node Future Work -- Display Tables, Future Work -- Making Elisp Function Calls Faster, Future Work -- Specifiers, Future Work 20975 @node Future Work -- Display Tables, Future Work -- Making Elisp Function Calls Faster, Future Work -- Specifiers, Future Work
18800 @section Future Work -- Display Tables 20976 @section Future Work -- Display Tables
18801 @cindex future work, display tables 20977 @cindex future work, display tables
18802 @cindex display tables, future work 20978 @cindex display tables, future work
18803 20979
20980 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20981
18804 #### It would also be really nice if you could specify that the 20982 #### It would also be really nice if you could specify that the
18805 characters come out in hex instead of in octal. Mule does that by 20983 characters come out in hex instead of in octal. Mule does that by
18806 adding a @code{ctl-hexa} variable similar to @code{ctl-arrow}, but 20984 adding a @code{ctl-hexa} variable similar to @code{ctl-arrow}, but
18807 that's bogus -- we need a more general solution. I think you need to 20985 that's bogus -- we need a more general solution. I think you need to
18808 extend the concept of display tables into a more general conversion 20986 extend the concept of display tables into a more general conversion
18841 @end example 21019 @end example
18842 21020
18843 Since more than one display table is possible, you have 21021 Since more than one display table is possible, you have
18844 great flexibility in mapping ranges of characters. 21022 great flexibility in mapping ranges of characters.
18845 21023
18846 @uref{../../www.666.com/ben/default.htm,Ben Wing}
18847
18848 @node Future Work -- Making Elisp Function Calls Faster, Future Work -- Lisp Engine Replacement, Future Work -- Display Tables, Future Work 21024 @node Future Work -- Making Elisp Function Calls Faster, Future Work -- Lisp Engine Replacement, Future Work -- Display Tables, Future Work
18849 @section Future Work -- Making Elisp Function Calls Faster 21025 @section Future Work -- Making Elisp Function Calls Faster
18850 @cindex future work, making Elisp function calls faster 21026 @cindex future work, making Elisp function calls faster
18851 @cindex making Elisp function calls faster, future work 21027 @cindex making Elisp function calls faster, future work
21028
21029 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
18852 21030
18853 @strong{Abstract: }This page describes many optimizations that can be 21031 @strong{Abstract: }This page describes many optimizations that can be
18854 made to the existing Elisp function call mechanism without too much 21032 made to the existing Elisp function call mechanism without too much
18855 effort. The most important optimizations can probably be implemented 21033 effort. The most important optimizations can probably be implemented
18856 with only a day or two of work. I think it's important to do this work 21034 with only a day or two of work. I think it's important to do this work
18949 21127
18950 Calling @code{Fset()} to change the variable's value. 21128 Calling @code{Fset()} to change the variable's value.
18951 21129
18952 @end enumerate 21130 @end enumerate
18953 21131
18954
18955 @end enumerate 21132 @end enumerate
18956
18957
18958 21133
18959 The entire series of calls to @code{specbind()} should be inline and 21134 The entire series of calls to @code{specbind()} should be inline and
18960 merged into the argument processing code as a single tight loop, with no 21135 merged into the argument processing code as a single tight loop, with no
18961 function calls in the vast majority of cases. The @code{specbind()} 21136 function calls in the vast majority of cases. The @code{specbind()}
18962 logic should be streamlined as follows: 21137 logic should be streamlined as follows:
18996 issue here is with symbols whose names begin with a colon. These 21171 issue here is with symbols whose names begin with a colon. These
18997 symbols should simply be disallowed completely as parameter names.) 21172 symbols should simply be disallowed completely as parameter names.)
18998 21173
18999 @end enumerate 21174 @end enumerate
19000 21175
19001
19002 @end enumerate 21176 @end enumerate
19003
19004
19005 21177
19006 Other optimizations that could be done are: 21178 Other optimizations that could be done are:
19007 21179
19008 @itemize 21180 @itemize
19009 @item 21181 @item
19083 true and is false. (Note: the optimization detailed in this item is 21255 true and is false. (Note: the optimization detailed in this item is
19084 probably not worth doing on the first pass.) 21256 probably not worth doing on the first pass.)
19085 21257
19086 @end itemize 21258 @end itemize
19087 21259
19088 @uref{../../www.666.com/ben/default.htm,Ben Wing}
19089
19090 @node Future Work -- Lisp Engine Replacement, , Future Work -- Making Elisp Function Calls Faster, Future Work 21260 @node Future Work -- Lisp Engine Replacement, , Future Work -- Making Elisp Function Calls Faster, Future Work
19091 @section Future Work -- Lisp Engine Replacement 21261 @section Future Work -- Lisp Engine Replacement
19092 @cindex future work, lisp engine replacement 21262 @cindex future work, lisp engine replacement
19093 @cindex lisp engine replacement, future work 21263 @cindex lisp engine replacement, future work
19094 21264
19095 @menu 21265 @menu
19096 * Future Work -- Lisp Engine Discussion:: 21266 * Future Work -- Lisp Engine Discussion::
19097 * Future Work -- Lisp Engine Replacement -- Implementation:: 21267 * Future Work -- Lisp Engine Replacement -- Implementation::
21268 * Future Work -- Startup File Modification by Packages::
19098 @end menu 21269 @end menu
19099 21270
19100 @node Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement, Future Work -- Lisp Engine Replacement 21271 @node Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement, Future Work -- Lisp Engine Replacement
19101 @subsection Future Work -- Lisp Engine Discussion 21272 @subsection Future Work -- Lisp Engine Discussion
19102 @cindex future work, lisp engine discussion 21273 @cindex future work, lisp engine discussion
19103 @cindex lisp engine discussion, future work 21274 @cindex lisp engine discussion, future work
19104 21275
21276 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
19105 21277
19106 @strong{Abstract: }Recently there has been a great deal of talk on the 21278 @strong{Abstract: }Recently there has been a great deal of talk on the
19107 XEmacs mailing lists about potential changes to the XEmacs Lisp engine. 21279 XEmacs mailing lists about potential changes to the XEmacs Lisp engine.
19108 Usually the discussion has centered around the question which is better, 21280 Usually the discussion has centered around the question which is better,
19109 Common Lisp or Scheme? This is certainly an interesting debate topic, 21281 Common Lisp or Scheme? This is certainly an interesting debate topic,
19223 to make this safe would be to do conservative garbage collection over 21395 to make this safe would be to do conservative garbage collection over
19224 the C stack and to eliminate the GCPRO declarations entirely. But how 21396 the C stack and to eliminate the GCPRO declarations entirely. But how
19225 many of the Lisp engines that are being considered have such a mechanism 21397 many of the Lisp engines that are being considered have such a mechanism
19226 built into them? 21398 built into them?
19227 21399
19228
19229 @subsubheading Maintainability. 21400 @subsubheading Maintainability.
19230 21401
19231 A new Lisp engine might well improve the maintainability of XEmacs by 21402 A new Lisp engine might well improve the maintainability of XEmacs by
19232 offloading the maintenance of the Lisp engine. However, we need to make 21403 offloading the maintenance of the Lisp engine. However, we need to make
19233 very sure that this is, in fact, the case before embarking on a project 21404 very sure that this is, in fact, the case before embarking on a project
19282 naturally in an object-oriented system. However, neither Scheme nor 21453 naturally in an object-oriented system. However, neither Scheme nor
19283 Common Lisp has been designed with object orientation in mind. There is 21454 Common Lisp has been designed with object orientation in mind. There is
19284 a standard object system for Common Lisp, but it is extremely complex 21455 a standard object system for Common Lisp, but it is extremely complex
19285 and difficult to understand. 21456 and difficult to understand.
19286 21457
19287 21458 @node Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Startup File Modification by Packages, Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement
19288 @uref{../../www.666.com/ben/default.htm,Ben Wing}
19289
19290
19291 @node Future Work -- Lisp Engine Replacement -- Implementation, , Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement
19292 @subsection Future Work -- Lisp Engine Replacement -- Implementation 21459 @subsection Future Work -- Lisp Engine Replacement -- Implementation
19293 @cindex future work, lisp engine replacement, implementation 21460 @cindex future work, lisp engine replacement, implementation
19294 @cindex lisp engine replacement, implementation, future work 21461 @cindex lisp engine replacement, implementation, future work
21462
21463 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
19295 21464
19296 Let's take a look at the sort of work that would be required if we were 21465 Let's take a look at the sort of work that would be required if we were
19297 to replace the existing Elisp engine in XEmacs with some other engine, 21466 to replace the existing Elisp engine in XEmacs with some other engine,
19298 for example, the Clisp engine. I'm assuming here, of course, that we 21467 for example, the Clisp engine. I'm assuming here, of course, that we
19299 are not going to be changing the interface here at the same time, which 21468 are not going to be changing the interface here at the same time, which
19431 something special needs to happen when this is done. This could be 21600 something special needs to happen when this is done. This could be
19432 handled fairly easily by having our new and improved @code{DEFUN} macro 21601 handled fairly easily by having our new and improved @code{DEFUN} macro
19433 define a new macro for use when calling a primitive. 21602 define a new macro for use when calling a primitive.
19434 @end enumerate 21603 @end enumerate
19435 21604
19436
19437 @subsubheading Make the Existing Lisp Engine be Self-contained. 21605 @subsubheading Make the Existing Lisp Engine be Self-contained.
19438 21606
19439 The goal of this stage is to gradually build up a self-contained Lisp 21607 The goal of this stage is to gradually build up a self-contained Lisp
19440 engine out of the existing XEmacs core, which has no dependencies on any 21608 engine out of the existing XEmacs core, which has no dependencies on any
19441 of the code elsewhere in the XEmacs core, and has a well-defined and 21609 of the code elsewhere in the XEmacs core, and has a well-defined and
19639 again on the old and buggy interfaced Lisp engine, it would note the 21807 again on the old and buggy interfaced Lisp engine, it would note the
19640 bug. 21808 bug.
19641 21809
19642 @end enumerate 21810 @end enumerate
19643 21811
19644 21812 @node Future Work -- Startup File Modification by Packages, , Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement
19645 @uref{../../www.666.com/ben/default.htm,Ben Wing} 21813 @subsection Future Work -- Startup File Modification by Packages
21814 @cindex future work, startup file modification by packages
21815 @cindex startup file modification by packages, future work
21816
21817 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
21818
21819 OK, we need to create a design document for all of this, including:
21820
21821 PRINCIPLE #1: Whenever you have auto-generated stuff, @strong{CLEARLY}
21822 indicate this in comments around the stuff. These comments get
21823 searched for, and used to locate the existing generated stuff to
21824 replace. Custom currently doesn't do this.
21825
21826 PRINCIPLE #2: Currently, lots of functions want to add code to the
21827 .emacs. (e.g. I get prompted for my mail address from
21828 add-change-log-entry, and then prompted if I want to make this
21829 permanent). There needs to be a Lisp API for working with arbitrary
21830 code to be added to a user's startup. This API hides all the details
21831 of which file to put the fragment in, where in it, how to mark it with
21832 magical comments of the right kind so that previous fragments can be
21833 replaced, etc.
21834
21835 PRINCIPLE #3: @strong{ALL} generated stuff should be loaded before any
21836 user-written init stuff. This way the user can override the generated
21837 settings. Although in the case of customize, it may work when the
21838 custom stuff is at the end of the init file, it surely won't work for
21839 arbitrary code fragments (which typically do @code{setq} or the like).
21840
21841 PRINCIPLE #4: As much as possible, generated stuff should be place in
21842 separate files from non-generated stuff. Otherwise it's inevitable
21843 that some corruption is going to result.
21844
21845 PRINCIPLE #5: Packages are encouraged, as much as possible, to work
21846 within the customize model and store all their customizations there.
21847 However, if they really need to have their own init files, these files
21848 should be placed in .xemacs/, given normal names
21849 (e.g. @file{saved-abbrevs.el} not .abbrevs), and there should be some magic
21850 comment at the top of the file that causes it to get automatically
21851 loaded while loading a user's init file. (Alternatively, the
21852 above-named API could specify a function that lets a package specify
21853 that they want such-and-such file loaded from the init file, and have
21854 the specifics of this get handled correctly.)
21855
21856 OVERARCHING GOAL: The overarching goal is to provide a unified
21857 mechanism for packages to store state and setting information about
21858 the user and what they were doing when XEmacs exited, so that the same
21859 or a similar environment can be automatically set up the next time.
21860 In general, we are working more and more towards being a truly GUI app
21861 where users' settings are easy to change and get remembered correctly
21862 and consistently from one session to the next, rather than requiring
21863 nasty hacking in elisp.
21864
21865 Hrvoje, do you have any interest in this? How about you, Martin?
21866 This seems like it might be up your alley. This stuff has been
21867 ad-hocked since kingdom come, and it's high time that we make this
21868 work properly so that it could be relied upon, and a lot of things
21869 could "just work".
19646 21870
19647 @node Future Work Discussion, Old Future Work, Future Work, Top 21871 @node Future Work Discussion, Old Future Work, Future Work, Top
19648 @chapter Future Work Discussion 21872 @chapter Future Work Discussion
19649 @cindex future work, discussion 21873 @cindex future work, discussion
19650 @cindex discussion, future work 21874 @cindex discussion, future work
19655 into the normal Future Work section. 21879 into the normal Future Work section.
19656 21880
19657 @menu 21881 @menu
19658 * Discussion -- garbage collection:: 21882 * Discussion -- garbage collection::
19659 * Discussion -- glyphs:: 21883 * Discussion -- glyphs::
21884 * Discussion -- Dialog Boxes::
21885 * Discussion -- Multilingual Issues::
21886 * Discussion -- Windows External Widget::
21887 * Discussion -- Packages::
21888 * Discussion -- Distribution Layout::
19660 @end menu 21889 @end menu
19661 21890
19662 @node Discussion -- garbage collection, Discussion -- glyphs, Future Work Discussion, Future Work Discussion 21891 @node Discussion -- garbage collection, Discussion -- glyphs, Future Work Discussion, Future Work Discussion
19663 @section Discussion -- garbage collection 21892 @section Discussion -- garbage collection
19664 @cindex discussion, garbage collection 21893 @cindex discussion, garbage collection
19665 @cindex garbage collection, discussion 21894 @cindex garbage collection, discussion
19666 21895
19667
19668 @example
19669 On Tue, Oct 12, 1999 at 03:36:59AM -0700, Ben Wing wrote: 21896 On Tue, Oct 12, 1999 at 03:36:59AM -0700, Ben Wing wrote:
19670 @end example
19671 21897
19672 So what am I missing here? 21898 So what am I missing here?
19673 21899
19674 @example
19675 In response, Olivier Galibert wrote: 21900 In response, Olivier Galibert wrote:
19676 @end example
19677 21901
19678 Two things: 21902 Two things:
19679 @enumerate 21903 @enumerate
19680 @item 21904 @item
19681 The purespace is gone 21905 The purespace is gone
19710 was used. 21934 was used.
19711 @item 21935 @item
19712 move the markbit outside of the lrecord. 21936 move the markbit outside of the lrecord.
19713 @end itemize 21937 @end itemize
19714 21938
19715
19716 The second solution is more appealing to me for a bunch of reasons: 21939 The second solution is more appealing to me for a bunch of reasons:
19717 @itemize @bullet 21940 @itemize @bullet
19718 @item 21941 @item
19719 more things are shared than only what is purecopied (not yet used 21942 more things are shared than only what is purecopied (not yet used
19720 functions come to mind) 21943 functions come to mind)
19743 So no, it's not a _necessity_. But it helps. And the automatic 21966 So no, it's not a _necessity_. But it helps. And the automatic
19744 sharing of all objects until you write to them explicitely is, I 21967 sharing of all objects until you write to them explicitely is, I
19745 think, really cool. 21968 think, really cool.
19746 @end enumerate 21969 @end enumerate
19747 21970
19748
19749 @example
19750 On 10/12/1999 5:49 PM Ben Wing wrote: 21971 On 10/12/1999 5:49 PM Ben Wing wrote:
19751 21972
19752 Subject: Re: hashtable-based marking and cleanups 21973 Subject: Re: hashtable-based marking and cleanups
19753 @end example
19754 21974
19755 OK, I can see the advantages. But: 21975 OK, I can see the advantages. But:
19756 21976
19757 @enumerate 21977 @enumerate
19758 @item 21978 @item
19793 22013
19794 @example 22014 @example
19795 http://www.amazon.com/exec/obidos/ASIN/0471941484/qid=939775572/sr=1-1/002-3092633-2509405 22015 http://www.amazon.com/exec/obidos/ASIN/0471941484/qid=939775572/sr=1-1/002-3092633-2509405
19796 @end example 22016 @end example
19797 22017
19798 @node Discussion -- glyphs, , Discussion -- garbage collection, Future Work Discussion 22018 @node Discussion -- glyphs, Discussion -- Dialog Boxes, Discussion -- garbage collection, Future Work Discussion
19799 @section Discussion -- glyphs 22019 @section Discussion -- glyphs
19800 @cindex discussion, glyphs 22020 @cindex discussion, glyphs
19801 @cindex glyphs, discussion 22021 @cindex glyphs, discussion
19802 22022
19803 Some comments (not always pretty!) by Ben: 22023 Some comments (not always pretty!) by Ben:
19804 22024
19805 @example
19806 March 20, 2000 22025 March 20, 2000
19807 22026
19808 Andy, I use the tab widgets but I've been having lots of problems. 22027 Andy, I use the tab widgets but I've been having lots of problems.
19809 22028
19810 1] Sometimes clicking on them does nothing. 22029 1] Sometimes clicking on them does nothing.
19815 to the front of the buffer list, like it should. It looks like you're 22034 to the front of the buffer list, like it should. It looks like you're
19816 doing this to avoid having the order of the tabs change, but this is 22035 doing this to avoid having the order of the tabs change, but this is
19817 wrong: If you don't reorder the buffer list, everything else gets 22036 wrong: If you don't reorder the buffer list, everything else gets
19818 screwed up. If you want the order of the tabs not to change, you need 22037 screwed up. If you want the order of the tabs not to change, you need
19819 to decouple this order from the buffer list order. 22038 to decouple this order from the buffer list order.
19820 @end example 22039
19821
19822 @example
19823 March 23, 2000 22040 March 23, 2000
19824 22041
19825 I'm very confused. The SIGIO timer is used @strong{only} for C-g. It has 22042 I'm very confused. The SIGIO timer is used @strong{only} for C-g. It has
19826 nothing to do with any other events. (sit-for 0) ought to 22043 nothing to do with any other events. (sit-for 0) ought to
19827 22044
19837 leery of introducing new Lisp functions to deal with specific problems. 22054 leery of introducing new Lisp functions to deal with specific problems.
19838 Pretty soon we end up with a whole bevy of such ill-defined functions, 22055 Pretty soon we end up with a whole bevy of such ill-defined functions,
19839 like we already have. I think instead, you should introduce the 22056 like we already have. I think instead, you should introduce the
19840 following primitive: 22057 following primitive:
19841 22058
22059 @example
19842 (wait-for-event redisplay &rest event-specs) 22060 (wait-for-event redisplay &rest event-specs)
22061 @end example
19843 22062
19844 Waits for one of the event specifications specified to happen. Returns 22063 Waits for one of the event specifications specified to happen. Returns
19845 something about what happened. 22064 something about what happened.
19846 22065
19847 REDISPLAY controls the behavior of redisplay during waiting. Something 22066 REDISPLAY controls the behavior of redisplay during waiting. Something
19848 like 22067 like
19849 22068
19850 - nil (never redisplay), 22069 @itemize @bullet
19851 - t (redisplay when it seems appropriate), etc. 22070 @item
22071 nil (never redisplay),
22072 @item
22073 t (redisplay when it seems appropriate), etc.
22074 @end itemize
19852 22075
19853 EVENT-SPECS could be 22076 EVENT-SPECS could be
19854 22077
22078 @example
19855 t -- drain all non-user events, and then return 22079 t -- drain all non-user events, and then return
19856 any-process -- wait till input or state change on any process 22080 any-process -- wait till input or state change on any process
19857 process -- wait till input or state change on process 22081 process -- wait till input or state change on process
19858 time -- wait till such-and-such time has elapsed 22082 time -- wait till such-and-such time has elapsed
19859 'user -- wait till user event has happened 22083 'user -- wait till user event has happened
19860 '(user predicate) -- wait till user event matching the predicate has 22084 '(user predicate) -- wait till user event matching the predicate has
19861 happened 22085 happened
19862 'event -- wait till any event has happened 22086 'event -- wait till any event has happened
19863 '(event predicate) -- wait till event matching the predicate has happened 22087 '(event predicate) -- wait till event matching the predicate has happened
22088 @end example
19864 22089
19865 The existing functions @code{next-event}, @code{next-command-event}, 22090 The existing functions @code{next-event}, @code{next-command-event},
19866 @code{accept-process-output}, @code{sit-for}, @code{sleep-for}, etc. could all be 22091 @code{accept-process-output}, @code{sit-for}, @code{sleep-for}, etc. could all be
19867 written in terms of this new command. You could use this command inside 22092 written in terms of this new command. You could use this command inside
19868 of your glyph code to ensure that the events get processed that need do 22093 of your glyph code to ensure that the events get processed that need do
19869 in order for widget updates to happen. 22094 in order for widget updates to happen.
19870 22095
19871 But you said something about need a magic event to invoke redisplay? 22096 But you said something about need a magic event to invoke redisplay?
19872 Why is that? 22097 Why is that?
19873 @end example 22098
19874
19875 @example
19876 April 2, 2000 22099 April 2, 2000
19877 22100
19878 the internal distinction between "widget" and "layout" is bogus. there 22101 the internal distinction between "widget" and "layout" is bogus. there
19879 exist widgets that do drawing and do layout of their children, 22102 exist widgets that do drawing and do layout of their children,
19880 e.g. group-box widgets and proper tab widgets. the only sensible 22103 e.g. group-box widgets and proper tab widgets. the only sensible
19881 distinction is between widgets with children and those without children. 22104 distinction is between widgets with children and those without children.
19882 @end example 22105
19883
19884 @example
19885 April 5, 2000 22106 April 5, 2000
19886 22107
19887 andy, i'm not sure i really believe that you need to cycle the event 22108 andy, i'm not sure i really believe that you need to cycle the event
19888 code to get widgets to redisplay, but in any case you should 22109 code to get widgets to redisplay, but in any case you should
19889 22110
19898 @end enumerate 22119 @end enumerate
19899 22120
19900 in other words, dispatch-non-command-events must go, and i am proposing 22121 in other words, dispatch-non-command-events must go, and i am proposing
19901 a general function (redisplay OBJECT) to replace the existing ad-hoc 22122 a general function (redisplay OBJECT) to replace the existing ad-hoc
19902 functions. 22123 functions.
19903 @end example 22124
19904
19905 @example
19906 April 6, 2000 22125 April 6, 2000
19907 22126
19908 the tab widget code should simply be able to create a whole lot of tabs 22127 the tab widget code should simply be able to create a whole lot of tabs
19909 without regard to the size of the gutter, and the surrounding layout 22128 without regard to the size of the gutter, and the surrounding layout
19910 widget (please please make layouts be proper widgets!) should 22129 widget (please please make layouts be proper widgets!) should
19911 automatically map and unmap them as necessary, to fill up the available 22130 automatically map and unmap them as necessary, to fill up the available
19912 space. perhaps this already works and what you're doing is just for 22131 space. perhaps this already works and what you're doing is just for
19913 optimization? but i get the feeling this is not the case. 22132 optimization? but i get the feeling this is not the case.
19914 @end example 22133
19915
19916 @example
19917 April 6, 2000 22134 April 6, 2000
19918 22135
19919 the function make-gutter-only-dialog-frame is bogus. the use of the 22136 the function make-gutter-only-dialog-frame is bogus. the use of the
19920 gutter here to hold widgets is an implementation detail and should not 22137 gutter here to hold widgets is an implementation detail and should not
19921 be exposed in the interface. similarly, make-search-dialog should not 22138 be exposed in the interface. similarly, make-search-dialog should not
19924 hidden. you should have a simple function make-dialog-frame that takes 22141 hidden. you should have a simple function make-dialog-frame that takes
19925 a dialog specification, and that's all you need to do. 22142 a dialog specification, and that's all you need to do.
19926 22143
19927 also, these dialog boxes, and this function make-dialog-frame, should 22144 also, these dialog boxes, and this function make-dialog-frame, should
19928 22145
19929 a] be in dialog.el, not gutter-items.el. 22146 @enumerate
19930 b] when possible, be placed in the interactive spec of standard lisp 22147 @item
19931 functions rather than accessed directly from menubar-items.el 22148 be in @file{dialog.el}, not gutter-items.el.
19932 c] wrapped in calls to should-use-dialog-box-p, so the user has control 22149 @item
22150 when possible, be placed in the interactive spec of standard lisp
22151 functions rather than accessed directly from @file{menubar-items.el}
22152 @item
22153 wrapped in calls to should-use-dialog-box-p, so the user has control
19933 over when dialog boxes appear. 22154 over when dialog boxes appear.
19934 @end example 22155 @end enumerate
19935 22156
19936 @example
19937 April 7, 2000 22157 April 7, 2000
19938 22158
19939 hmmm ... in that case, the whitespace absolutely needs to be specified 22159 hmmm ... in that case, the whitespace absolutely needs to be specified
19940 as properties of the layout widget (e.g. :border-width and 22160 as properties of the layout widget (e.g. :border-width and
19941 :border-height), rather than setting an overall size. you have no idea 22161 :border-height), rather than setting an overall size. you have no idea
19942 what the correct size should be if the user changes font size or uses 22162 what the correct size should be if the user changes font size or uses
19943 translations in a different language. 22163 translations in a different language.
19944 22164
19945 Your modus operandi should be "hardcoded pixel sizes are @strong{always} bad." 22165 Your modus operandi should be "hardcoded pixel sizes are @strong{always} bad."
19946 @end example 22166
19947
19948 @example
19949 April 7, 2000 22167 April 7, 2000
19950 22168
19951 you mean the number of tabs adjusts, or the size of each tab adjusts (by 22169 you mean the number of tabs adjusts, or the size of each tab adjusts (by
19952 making the font smaller or something)? if the size of a single tab is 22170 making the font smaller or something)? if the size of a single tab is
19953 not related to the total space the tabs can fix into, then it should be 22171 not related to the total space the tabs can fix into, then it should be
19964 a maximum width (which should be done in 'n' sizes, not in pixels!). 22182 a maximum width (which should be done in 'n' sizes, not in pixels!).
19965 22183
19966 i won't stop complaining until i see nearly every one of those 22184 i won't stop complaining until i see nearly every one of those
19967 pixel-width and pixel-height parameters gone, and the remaining ones 22185 pixel-width and pixel-height parameters gone, and the remaining ones
19968 there for a very, very good reason. 22186 there for a very, very good reason.
19969 @end example 22187
22188 April 7, 2000
22189
22190 Andy Piper wrote:
19970 22191
19971 @example 22192 @example
19972 April 7, 2000
19973
19974 Andy Piper wrote:
19975
19976 > At 03:51 PM 4/6/00 -0700, Ben Wing wrote: 22193 > At 03:51 PM 4/6/00 -0700, Ben Wing wrote:
19977 > >[the function make-gutter-only-dialog-frame is bogus] 22194 > >[the function make-gutter-only-dialog-frame is bogus]
19978 > 22195 >
19979 > The problem is that some of the callbacks and such need access to the 22196 > The problem is that some of the callbacks and such need access to the
19980 > @strong{created} frame, so you end up in a catch 22 unless you do what I've done. 22197 > @strong{created} frame, so you end up in a catch 22 unless you do what I've done.
22198 @end example
19981 22199
19982 [Ben proposes other ways to avoid exposing all the guts, as in 22200 [Ben proposes other ways to avoid exposing all the guts, as in
19983 @code{make-gutter-only-dialog-frame}:] 22201 @code{make-gutter-only-dialog-frame}:]
19984 22202
19985 @enumerate 22203 @enumerate
19998 (depending on where the glyph is) where the invocation actually 22216 (depending on where the glyph is) where the invocation actually
19999 happened. That way, the callbacks can easily figure out the dialog 22217 happened. That way, the callbacks can easily figure out the dialog
20000 box and its parent, and not have to worry about embedding it in at 22218 box and its parent, and not have to worry about embedding it in at
20001 creation time. 22219 creation time.
20002 @end enumerate 22220 @end enumerate
20003 @end example 22221
20004
20005 @example
20006 April 15, 2000 22222 April 15, 2000
20007 I don't understand when you say "the various types of callback". Are 22223 I don't understand when you say "the various types of callback". Are
20008 you using the callback for various different purposes? 22224 you using the callback for various different purposes?
20009 22225
20010 Your widget callbacks should work just like any other callback: they 22226 Your widget callbacks should work just like any other callback: they
20011 take two arguments, one indicating the object to which the callback was 22227 take two arguments, one indicating the object to which the callback was
20012 attached (an image instance, i think), and the event that caused the 22228 attached (an image instance, i think), and the event that caused the
20013 callback to be invoked. 22229 callback to be invoked.
20014 @end example 22230
20015
20016 @example
20017 April 17, 2000 22231 April 17, 2000
20018 22232
20019 I am completely vetoing widget-callback-current-channel. How about you 22233 I am completely vetoing widget-callback-current-channel. How about you
20020 create a new keyword, :new-callback, that is a function of two args, 22234 create a new keyword, :new-callback, that is a function of two args,
20021 like i specified before. 22235 like i specified before.
20026 result as widget-callback-current-channel. 22240 result as widget-callback-current-channel.
20027 22241
20028 the problem with this and everything you've proposed is that there's no 22242 the problem with this and everything you've proposed is that there's no
20029 way, of course, to get at the actual widget that you were invoked from. 22243 way, of course, to get at the actual widget that you were invoked from.
20030 would you propose adding widget-callback-current-widget? 22244 would you propose adding widget-callback-current-widget?
22245
22246 @node Discussion -- Dialog Boxes, Discussion -- Multilingual Issues, Discussion -- glyphs, Future Work Discussion
22247 @section Discussion -- Dialog Boxes
22248 @cindex discussion, dialog boxes
22249 @cindex dialog boxes, discussion
22250
22251 @example
22252 From:
22253 Ben Wing <ben@@666.com>
22254 10/7/1999 5:57 PM
22255
22256 Subject:
22257 Re: Animated gif patch (2)
22258 To:
22259 Andy Piper <andy@@xemacs.org>
22260 CC:
22261 xemacs-review@@xemacs.org, xemacs-beta@@xemacs.org
22262
22263
22264
22265
22266 The distinction between layouts and widgets makes no sense, so you should combine
22267 the different data required. Consider a grouping widget. Is this a layout or a
22268 widget? It draws, like a widget, but has children, like a layout. Same for a tab
22269 widget, properly implemented. It draws, handles input, has children, and makes
22270 choices about how to lay them out.
22271
22272 ben
22273
22274 From:
22275 Ben Wing <ben@@666.com>
22276 9/7/1999 8:50 PM
22277
22278 Subject:
22279 Re: Layouts done
22280 To:
22281 Andy Piper <andyp@@beasys.com>
22282
22283
22284
22285
22286 this sounds great! where can i see the code?
22287
22288 as for user-defined layouts, you must certainly have some sort of abstraction
22289 layer for layouts, with DEFINE_LAYOUT_TYPE or something similar just like device
22290 types and such. If not, you should certainly make one ... it would have methods
22291 such as query-geometry and do-layout. It should be easy to create a user-defined
22292 layout if you have such an abstraction.
22293
22294 with a user-defined layout, complex built-in layouts such as grid should not be
22295 necessary because it's so easy to write snippets of lisp.
22296
22297 as for the "redisplay too much" problem, perhaps you could put a dirty flag in
22298 each glyph indicating whether it needs to be redisplayed, recalculated, etc.?
22299
22300 Andy Piper wrote:
22301
22302 > You may want to check them out. I haven't done the user-defined layout
22303 > callback - I'm not sure what sort of API this could have. Keywords I've done:
22304 >
22305 > :orientation - vertical or horizontal
22306 > :justify - left, center or right
22307 > :border - etch-in, etch-out, bevel-in, bevel -out or text (which gives you
22308 > etch-in with a title)
22309 >
22310 > You can embed any glyph type in a layout.
22311 >
22312 > There is probably room for improvements for justify to do grid-type layouts
22313 > as per java.
22314 >
22315 > The only annoying thing is that I've hacked up font-lock support to do a
22316 > progress gauge in the gutter area. I've used a layout to set things out
22317 > correctly. The problem is if you change one of the sub-widgets, the whole
22318 > layout gets redisplayed because it is treated as a single glyph by redisplay.
22319 >
22320 > Oh, and I've done line based scrolling so that glyphs scroll off the page
22321 > in units of the average display line height rather than the whole line at
22322 > once. This could easily be converted to pixel scrolling but would be very
22323 > slow I fear.
22324 >
22325 > andy
22326 > --------------------------------------------------------------
22327 > Dr Andy Piper
22328 > Senior Consultant Architect, BEA Systems Ltd
22329
22330
22331
22332
22333 From:
22334 Ben Wing <ben@@666.com>
22335 8/10/1999 11:11 PM
22336
22337 Subject:
22338 Re: Widgets
22339 To:
22340 Andy Piper <andy@@xemacs.org>
22341
22342
22343
22344
22345 I think you might have misinterpreted what i meant. I meant to say that XEmacs should
22346 implement the @strong{concept} of a hierarchy of nested child "widgets" or "gui items" or
22347 whatever we want to call them -- this includes container "widgets" such as grouping
22348 widgets (which draw a border around the children, like in Windows), tab widgets, simple
22349 layout widgets (invisible, but lay out their children appropriately), etc, plus leaf
22350 "widgets" (buttons, sliders, etc., also standard Emacs windows). The layout calculations
22351 for these widgets would be handled entirely by XEmacs in a window-system-independent way.
22352 There is no need to create a corresponding hierarchy of window-system
22353 widgets/controls/whatever if it's not required, and certainly no need to try to use the
22354 window-system-supplied geometry management routines. It's absolutely necessary to support
22355 this nesting concept in XEmacs, however, or it's impossible to have easily-designable
22356 dialog boxes. On the other hand, I think it @strong{is} required to create much of this
22357 hierarchy within the actual window system, at the very least for non-invisible container
22358 widgets (tab, grouping, etc.), otherwise we will have very bogus, non-native-looking
22359 containers like your current tab-widget implementation. It's critical for XEmacs to be
22360 able to create dialog boxes in Windows or Motif that look just like those in any other
22361 standard application. Otherwise people will continue to think that XEmacs is a
22362 backwards-looking, badly implemented piece of software, which in many ways it is,
22363 particularly in regards to its user interface.
22364
22365 Perhaps we should talk on the phone? This typing is quite hard for me still. What hours
22366 are you at work? My hours are approx. 2pm - 2am Pacific time (GMT - 7 hours currently).
22367
22368 ben
22369
22370
22371 From:
22372 Ben Wing <ben@@666.com>
22373 7/21/1999 2:44 AM
22374
22375 Subject:
22376 Re: Tabs 'n widgets screenshot
22377 To:
22378 Andy Piper <andy@@xemacs.org>
22379 CC:
22380 xemacs-beta@@xemacs.org, wmperry@@aventail.com
22381
22382
22383
22384
22385 This is real cool, but looking at this, it's clear that it doesn't look the
22386 way tab widgets are supposed to work. In particular, of course, they should
22387 have the proper borders around the stuff displayed. I've attached a screen
22388 shot of a typical Windows dialog box with a tab widget in it. The problem
22389 lies with this "expanded gutter" concept. Tabs are @strong{NOT} extra graphical junk
22390 placed in the gutters of a buffer but are GUI objects with @strong{children} inside
22391 of them. This is the right way to do things, and you would need no extra
22392 gutter functionality at all for this. You just need to implement the concept
22393 of GUI objects containing other GUI objects within them. One such GUI object
22394 needs to be a "Emacs-text" GUI object, which is an Emacs window and contains a
22395 buffer within it. At this level, you need not be concerned with the
22396 complexities of geometry layout. The only change that needs to be made in the
22397 overall strategy of frames, windows, etc. is that windows need not be exactly
22398 contiguous and tiled, as long as they are contained within a frame. Or more
22399 specifically: Given that you could always split a window contained inside a
22400 GUI object, we just need to expand things so that each frame has @strong{multiple}
22401 hierarchies of windows in it, rather than just one. A hierarchy of windows
22402 can nest inside of another window -- e.g. I put a tab widget or a text widget
22403 inside of a buffer. This should be easy to implement -- just change things so
22404 there are multiple hierarchies of windows where there are one, each (except
22405 the top-level one) being rooted inside some other window.
22406
22407 Anyone willing to implement this? Andy?
22408
22409
22410 From:
22411 Ben Wing <ben@@666.com>
22412 6/30/1999 3:30 PM
22413
22414 Subject:
22415 Re: Focus Help!
22416 To:
22417 Andy Piper <andy@@xemacs.org>
22418 CC:
22419 Ben Wing <ben@@xemacs.org>, martin@@xemacs.org, andyp@@beasys.com
22420
22421
22422
22423
22424 It sounds like you're doing very good work. It also sounds like the approach
22425 you have followed is the correct one. Now, it seems like there isn't really
22426 that much work left to get dialog boxes working. What you really just need to
22427 do is implement container widgets, that is to say, subwindows that can contain
22428 other subwindows. For example, the tab widget works this way. (It sounds like
22429 you have already implemented tab widgets, so I don't quite see how you've done
22430 this without the concept of container widgets.) So you might just try adding a
22431 framework for container widgets and then implementing very simple container
22432 widgets. The basic container widgets are:
22433
22434 1. A vertical-layout widget, which draws nothing itself and lays out its
22435 children one above the next.
22436 2. A horizontal-layout widget, which draws nothing itself and lays out its
22437 children side-to-side.
22438 3. A box (or "grouping") widget, which draws a rectangle around its single child
22439 and optionally draws some text on the top or bottom line of the rectangle.
22440 4. A tab widget, which displays a series of tabs horizontally at the top of its
22441 area, and then below it places one of its children,
22442 corresponding to the selected tab.
22443 5. A user widget, which draws nothing itself and does no layout at all on its
22444 children, except that it has a "layout callback"
22445 property, a Lisp function, so that the programmer can control the layout.
22446
22447 The framework is as follows:
22448
22449 1. Every widget has at least the following properties:
22450 a) a size, whose value can be "unspecified", which might be implemented
22451 using the value -1. The default value should be "unspecified".
22452 b) whether it's mapped, i.e. whether it will be displayed. (Some container
22453 widgets, such as the tab widget, set the mapped
22454 property themselves on their children. Others, such as the vertical and
22455 horizontal layout widgets, don't change this property but pay attention to it,
22456 and ignore completely all children marked as unmapped.) The default value should
22457 be "true".
22458 c) whether its size can be changed by another widget's layout routine. The
22459 default value should be "true".
22460 d) a layout procedure, which (potentially at least) determines the size of
22461 the widget as well as the position, size and mappedness of its child widgets.
22462 The layout procedure is inherent in the widget and is not an external property
22463 of the widget (except in the case of the "user widget"): it is instead more like
22464 the redisplay callback that each widget has.
22465 2. Every container widget contains a property which is a list of child widgets.
22466 3. Every child widget contains the following properties:
22467 a) a position indicating where the child is located relative to the top
22468 left corner of its parent. The position's value can be "unspecified", which
22469 might be implemented using the value -1. The default value should be
22470 "unspecified".
22471 b) whether its position can be changed by another widget's layout routine.
22472 The default value should be "true".
22473 4. All of the properties just listed (except possibly the layout procedure) can
22474 be modified directly by the programmer, and there are no proscriptions against
22475 doing so. However, if the programmer wants to resize, reposition, map or unmap
22476 a widget in such a way that the layout of all the other widgets in the tree
22477 changes appropriately, he should use a special function to change the property,
22478 as described below.
22479
22480 The redisplay mechanism pays attention to the position, size, and mappedness
22481 properties and to the hierarchy of widgets, mapping, resizing and repositioning
22482 the corresponding subwindows (the "real representation" of the widgets) as
22483 necessary. It also pays attention to the hierarchy of the widgets, making sure
22484 that container subwindows get drawn before their child subwindows. When it
22485 encounters widgets with an unspecified size, it should not draw them, and should
22486 issue a warning. When it encounters widgets with an unspecified position, it
22487 should draw them at position (0, 0) and should issue a warning.
22488
22489 The above framework should be fairly simple to implement and is basically
22490 universal across all high-level windowing system toolkits. The stickyness comes
22491 with what procedures you follow for getting the layout done.
22492
22493 Andy, I understand that implementing this may seem like a daunting task.
22494 Therefore, I propose that at first you implement the above framework but don't
22495 implement any of the layout procedures, or any of the functions that call them:
22496 Just make them stubs that do nothing. This way, the Lisp programmer can still
22497 create any dialog boxes he wants, he just has to set the sizes and positions of
22498 all the widgets explicitly, and then recompute them whenever the widget tree is
22499 resized (once you get around to allowing this). I have a lot more to write
22500 about exactly how the layout procedures work, but I'll send that to you later
22501 once you're ready.
22502
22503 You should also think about making a way to have widget trees as top-level
22504 windows rather than just glyphs in a buffer. There's already the concept of
22505 "popup" frames. You could provide an easy way to create a popup frame with no
22506 menu, toolbars, scrollbars, modeline or minibuffer, and put a single glyph in
22507 the displayed buffer that takes up the whole Emacs window.
22508
22509 Ben
22510
22511
22512
22513
22514 March 20, 2000
22515
22516 You wrote to me awhile ago about this and asked about documentation, and I
22517 dictated a response but never got it sent, so here it is:
22518
22519 I don't think there's any more documentation on how things work under Xt but it
22520 should be clear. The EmacsFrame widget is the widget corresponding to the X
22521 window that Emacs draws into and there is a handler for expose events called
22522 from Xt which arranges for the invalidated areas to get redrawn. I think this
22523 used to happen as part of the handler itself but now it is delayed until the
22524 next call to redisplay.
22525
22526 However, one thing that you absolutely must not do is remove the Xt support.
22527 This would be an incredibly unfriendly thing to do as it would prevent people
22528 from using any widget set other than Qt or GTK. Keep in mind that people run
22529 XEmacs on all sorts of different versions of X in Unix, and Xt is the standard
22530 and the only toolkit that probably exists on all of these systems.
22531
22532 Pardon me if I've misunderstood your intentions w.r.t. this.
22533
22534 As for how you would implement GTK support, it will not be very hard to convert
22535 redisplay to draw into a GTK window instead of an Xt window. In fact redisplay
22536 basically doesn't know about Xt at all, except in the portion that handles
22537 updating menubars and scrollbars and stuff that's directly related to Xt.
22538
22539 What you'd probably want to do is create a new set of event routines to replace
22540 the ones in event-Xt.c. On the display side you could conceivably create a new
22541 device type but you probably wouldn't want to do that because it would be an
22542 externally visible change at the Lisp level. You might simply want to put a
22543 flag on each frame indicating what sort of toolkit the frame was created under
22544 and put conditions in the redisplay code and the code to update toolbars and
22545 menubars and so forth to test this flag and do the appropriate thing.
22546
22547
22548 April 12, 2000
22549
22550 This is way cool, buuuuutttttttt .............
22551
22552 what we @strong{really} need is the GUI interface on top of it. I've taken a shot at
22553 it with generic-print-buffer
22554 (print-buffer is taken by lpr, which is such a total mess that it needs to be
22555 trashed; or at least, the generic
22556 stuff in this package needs to be taken out and properly genericized). For
22557 the moment, generic-print-buffer
22558 just does something like what Kirill's been posting if we're running windows,
22559 and uses lpr otherwards. However, what we absofuckinglutely need is a Lisp
22560 interface onto @code{EnumPrinters()} so that we can get the
22561 list of printers and have a nice menu listing the available printers, and you
22562 can check the one you want. People in the Windows world don't normally even
22563 know the names of their local printers!
22564
22565 Kirill, given what I've done in @file{simple.el} and @file{menubar-items.el}, do you think
22566 you could add the @code{EnumPrinters()}
22567 support and fix up the GUI? If you don't feel comfortable with the GUI, at
22568 least do the @code{EnumPrinters()}.
22569
22570 But ... Kirill, I tried your formula for printing and nothing happened.
22571 Perhaps I didn't call redisplay-frame or something? You need to fix this up
22572 and make it work for multi-page documents. (Again, this is in
22573 generic-print-buffer.) Nothing special, it just needs to fucking work! There
22574 are zillions and zillions of postings every day on xemacs-nt about how to get
22575 printing working, and none seem to refer to the built-in support.
22576
22577 ben
22578
22579
22580 April 19, 2000
22581
22582 Kirill 'Big K' Katsnelson wrote:
22583
22584 > Some time ago, Ben Wing wrote...
22585 > >kirill, the interface i created is more general, like this:
22586 >
22587 > [snip]
22588 >
22589 > >Unfortunately I haven't implemented much of this; just some of the file
22590 > >dialog box. but i think
22591 > >this is better than creating new mswindows-specific primitives. if you
22592 > >are interested in working on
22593 > >this, i'll send you the code i have.
22594 >
22595 > Sure. Can you just commit it for my starting point?
22596 >
22597 > >also, the dialogs shouldn't have anything directly to do with the printer
22598 > >device. all they should
22599 > >do is return a set of values. it's the caller's responsibility to
22600 > >interpret them and set device
22601 > >properties accordingly. this way, there's a complete separation between
22602 > >the underlying
22603 > >functionality and the gui.
22604 >
22605 > Unfortunately. I thought about doing it this way, but we then lose a lot of
22606 > printer-specific setup in this case. The DEVMODE structure contains two
22607 > parts: printer independent, as defined by SDK typedef DEVMODE, and
22608 > some trailing bytes, of unknown structure, used by a driver. The driver
22609 > only returns the extra length it wants. Such options as PCL ReT resolution
22610 > enhancement options or PostScript negative output are not available
22611 > through the standard part of the devmode structure, and stored in the
22612 > driver part (printer dialogs are driver-specific).
22613 >
22614 > So we have total of three options:
22615 > - Not to implement options beyond standard DEVMODE
22616 > - Make DEVMODE a Lisp object.
22617 > - Hide DEVMODE inside the device object.
22618 >
22619 > First case looks cheesy. Letting DEVMODE fall off the printer is no good
22620 > either, since one needs both the device and the devmode to edit the
22621 > devmode, and they must match. I am still convinced that the devmode and
22622 > the printer should not be separated.
22623
22624 hmm, i see ... this completely breaks abstraction though. it fails in various
22625 scenarios, e.g. a program wants to initialize the dialog box with certain
22626 non-driver-specific properties, without caring about the particular printer.
22627
22628 i think you should create a new print-properties object that encapsulates all
22629 printer properties (which can be changed using get/put), including the printer
22630 name, and contains a DEVMODE in it. if the printer name gets changed, the
22631 DEVMODE might change too, but the print-properties object itself stays the
22632 same. you pass this object as a parameter to the dialog box, and it gets
22633 changed accordingly. you can call something like set-device-print-properties to
22634 stick everything in this structure into the device. (you could imagine a case
22635 where someone wanted to keep multiple print configurations around ...)
22636
22637 >
22638 >
22639 > Big K
22640
22641 --
22642 Ben
22643
22644 @end example
22645
22646 @node Discussion -- Multilingual Issues, Discussion -- Windows External Widget, Discussion -- Dialog Boxes, Future Work Discussion
22647 @section Discussion -- Multilingual Issues
22648 @cindex discussion, multilingual issues
22649 @cindex multilingual issues, discussion
22650
22651 @example
22652
22653 4/10/2000 4:13 AM
22654
22655 BTW I am planning on adding some more powerful font-mapping capabilities to
22656 XEmacs (i.e. how do we map particular characters to the proper fonts that can
22657 display them, and how do we map the character's codes to the indices into the
22658 font). These will replace to hackish charset-registry/charset-ccl-program stuff
22659 we currently have, and be [a] much more powerful, [b] designed in a
22660 window-system-independent way, [c] works with specifiers so you can control the
22661 mapping of individual buffers, and [d] works on a character rather than charset
22662 level, to correctly handle Unicode. One possible usage would be to declare that
22663 all latin1 in a particular buffer to be displayed with latin2 fonts; I bet
22664 Hrvoje would really appreciate that
22665
22666 ---------------------------------------------------------------------------
22667
22668 April 10, 2000
22669
22670 [info from "creation of generic macros for accessing internally formatted data"]
22671
22672 Hmm, so there I just wrote a detailed design for the macros. I would be
22673 @strong{THRILLED} and overjoyed if you went ahead and implemented this mechanism, or
22674 parts of it.
22675
22676 I've just finished arranging for a new transcriptionist, and soon I should be
22677 able to send off and get back my dictation of my (a) exposing streams to lisp,
22678 and (b) allowing for proper lisp-created coding systems, which define their
22679 reading, writing, and detecting methods in lisp.
22680
22681
22682 BTW How's it going wrt your Unicode and decode-priority stuff?
22683
22684 And ... you sent me mail asking what it was you had promised me, and listed
22685 only one thing, which was
22686 profiling of vm and certain other operations you found showed tremendous
22687 slowdown with Japanese characters. The other main thing I want from you is
22688
22689 -- Your priorities, as an actual Japanese user and XEmacs developer,
22690 concerning what MULE work should be done, how it should be done, in what
22691 order, etc.
22692
22693 I'm sure there's something else, but it's been awhile since I took my sleeping
22694 dose and my brain can barely function anymore. Just let me know how you're
22695 going to proceed with the above macro changes.
22696
22697 BTW there's some nice Perl scripts written by Martin and fixed by me to make
22698 global-search-and-replace
22699 much, much easier. I've attached them. The first one is a shell script that
22700 works like
22701
22702 gr foo bar *.[ch]
22703
22704 and replaces foo with bar in all of the files. For each modified file, a
22705 backup is created in the backup/ directory, which is created as necessary.
22706 This shell script is a fairly trivial front end onto global-replace2, which is
22707 a perl script that takes one argument (a Perl expression such as s/foo/bar/g)
22708 and a list of files obtained by reading the stdin, and does the same global
22709 replacement. This means that the regexp syntax used here has to be perl-style
22710 rather than standard emacs/grep style.
22711
22712 ben
22713
22714 ---------------------------------------------------------------------
22715
22716
22717 From:
22718 Ben Wing <ben@@666.com>
22719 12/23/1999 3:34 AM
22720
22721 Subject:
22722 Re: check process state before accessing coding_stream (fix PR#1061)
22723 To:
22724 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp>
22725 CC:
22726 XEmacs Developers <xemacs-beta@@xemacs.org>
22727
22728
22729
22730
22731 Thankfully, nearly all of this horridity you bring up is irrelevant. In
22732 XEmacs, "gettext" does not refer to any standard API, but is merely a stand-in
22733 for a translation routine (presumably written by us). We may as well call it
22734 something else. We define our own concept of "current language". We also
22735 allow for a function that needs a different version for each language, which
22736 handles all cases where simple translation isn't sufficient, e.g. when you
22737 have to pluralize some noun given to you or insert the correct form of the
22738 definite article. No weird hacks needed. No interaction problems with other
22739 pieces of software.
22740
22741 What I wrote "awhile ago" is (unfortunately) not anywhere public currently,
22742 but it's on my list to put it on the web site. "There you go again" is
22743 usually not true; most of what I quote was indeed put out publicly at some
22744 point, but I'll try to be more explicit about this in the future.
22745
22746 ben
22747
22748 "Stephen J. Turnbull" wrote:
22749
22750 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes:
22751 >
22752 > Ben> "Stephen J. Turnbull" wrote:
22753 >
22754 > >> What I have in mind is not just gettext-izing everything in the
22755 > >> XEmacs core sources. I currently believe that to be
22756 > >> unacceptable
22757 >
22758 > Ben> I don't quite understand. Could you elaborate and give some
22759 > Ben> examples?
22760 >
22761 > Examples? Hmm.
22762 >
22763 > First, there's the surface of Jan's y-or-n-p example. You have to
22764 > coordinate the translation of the message string and the response
22765 > prompt. This is handled by y-or-n-p itself (I see that we already do
22766 > have gettext for Emacs Lisp, that's nice to know).
22767 >
22768 > Except that it's not really handled by y-or-n-p. There's no reason to
22769 > suppose that somebody writing a Lisp package would necessarily use the
22770 > XEmacs domain (in fact, due to the way gettext binds text domains---if
22771 > I understand that correctly---we don't want that to be the case,
22772 > because it means that every time a Lisp package is updated the whole
22773 > XEmacs catalog must also be updated). So which domain gets used for
22774 > the message string?
22775 >
22776 > In the current implementation, it is the domain of y-or-n-p. So
22777 > packages with their own domain won't get y-or-n-p prompts correctly
22778 > translated. But that means that the package should do its own
22779 > translation. But now you're applying gettext to the same string
22780 > twice; you just have to pray the that translator upstream doesn't
22781 > collide with an English string that's in the XEmacs domain. (The
22782 > gettext docs mention the similar problem of English words with
22783 > multiple meanings that must map to different words in the target
22784 > language; this can be disambiguated by various trickeries in forming
22785 > the strings ... but only if you "own" them, which in the multi-domain,
22786 > interated gettext example you do not.) AFAICT this means that you
22787 > must never pass untranslated strings across public APIs, but this may
22788 > or may not be reasonable, and certainly is inconvenient.
22789 >
22790 > Next, we have to translate the possible answer strings to match the
22791 > language being passed by the user. This is presumably OK here,
22792 > because it's done by y-or-n-p. But what if y-or-n-p returned a string
22793 > rather than a boolean? Then we would need to coordinate the
22794 > presentation of the prompt (done by y-or-n-p) and the translation of
22795 > the possible answer strings (done by the caller). This can in fact be
22796 > done using dgettext with the XEmacs domain, but you must know that
22797 > y-or-n-p is in the XEmacs domain. This is not necessarily going to be
22798 > obvious, and it might very well be that sets of related packages might
22799 > have the same domain, so you wouldn't necessarily know which domain is
22800 > appropriate by looking at the requires.
22801 >
22802 > And what happens if one domain does supply translations for a language
22803 > and the other does not? AFAIK, gettext has no way to find out if this
22804 > is the case. But you might very will prefer a global fallback to
22805 > English if substantial phrases are drawn from both domains, while you
22806 > might prefer string-by-string fallback if the main text is translated
22807 > and only a few words are left to fallback to English.
22808 >
22809 > Aside from confusing users, this puts a great burden on programmers.
22810 > Programmers need to know about the status of the domains of packages
22811 > they use as well as the XEmacs domain; they need to program
22812 > defensively against the possibility that some package they use will
22813 > become gettext-ized, or the translation projects will be out of synch
22814 > (some teams will do the calling package first, others will do the
22815 > caller package first).
22816 >
22817 > I don't think anybody will use gettext in these circumstances. At
22818 > least not after they get the first bug report that "XEmacs is stuck in
22819 > an infinite y-or-n-p loop and I can't get out."
22820 >
22821 > Ben> I wrote this awhile ago:
22822 >
22823 > "There you go again." Not anywhere I could see it! (At least, it
22824 > doesn't look familiar and grepping the archives doesn't turn it up.)
22825 >
22826 > OK, you win. Subscribe me to xemacs-review. Or whatever seems
22827 > appropriate.
22828 >
22829 > --
22830 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
22831 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
22832 > _________________ _________________ _________________ _________________
22833 > What are those straight lines for? "XEmacs rules."
22834
22835 --
22836 In order to save my hands, I am cutting back on my responses, especially
22837 to XEmacs-related mail. You _will_ get a response, but please be patient.
22838 If you need an immediate response and it is not apparent in your message,
22839 please say so. Thanks for your understanding.
22840
22841
22842
22843 --------------------------------------------------------------------
22844
22845
22846 From:
22847 Ben Wing <ben@@666.com>
22848 12/21/1999 2:22 AM
22849
22850 Subject:
22851 Re: check process state before accessing coding_stream (fix PR#1061)
22852 To:
22853 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp>
22854 CC:
22855 XEmacs Developers <xemacs-beta@@xemacs.org>
22856
22857
22858
22859
22860
22861 "Stephen J. Turnbull" wrote:
22862
22863 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes:
22864 >
22865 > Ben> Implementing message translation is not that hard.
22866 >
22867 > What I have in mind is not just gettext-izing everything in the XEmacs
22868 > core sources. I currently believe that to be unacceptable (see Jan's
22869 > message for the pitfalls in I18N; it's worse for M17N). I think
22870 > really solving this problem needs a specifier-like fallback mechanism
22871 > (this would solve Jan's example because you could query the
22872 > text-specifier presenting the question for the affirmative and
22873 > negative responses, and the catalog-building mechanism would have
22874 > checks to make sure they were properly set, perhaps a locale
22875 > (language) argument), and gettext is just not sufficient for that.
22876
22877 I don't quite understand. Could you elaborate and give some examples?
22878
22879 >
22880 >
22881 > At a minimum, we need to implement gettext for Lisp packages.
22882 > (Currently, gettext is only implemented for C AFAIK.) But this could
22883 > potentially cuase more trouble than it's worth.
22884 >
22885 > Ben> A lot depends on priority: How important do you think this
22886 > Ben> issue is to your average Japanese/Chinese/etc. user?
22887 >
22888 > Which average Japanese (etc) user? The English-skilled (relatively)
22889 > programmer in the free software movement, or my not-at-all-competent
22890 > undergrad students who I would love to have using an Emacs? This is a
22891 > really important ease-of-use issue.
22892 >
22893 > Realistically, for Japanese, it's low priority. The Japanese team in
22894 > the GNU Translation Project is doing very little AFAIK, so even if the
22895 > capability were there, I doubt the message catalog would soon be done.
22896 >
22897 > But I think that many non-English speakers would find it very
22898 > attractive, and for many languages there are well-organized and
22899 > productive translation teams. I suspect that if the I18N facility
22900 > were well-designed, many Western European languages would have full
22901 > catalogs within a year (granted, they are the ones where it's least
22902 > needed :-( ).
22903 >
22904 > Personally, I think doing it well is hard, and of little benefit to
22905 > _current_ core XEmacs constituency. I think doing a good job, with
22906 > catalogs, would be very attractive to many non-English-speaking
22907 > _potential_ users.
22908 >
22909 > Ben> How does it compare to some of the other important Mule
22910 > Ben> issues that Martin and I are (trying to work) on?
22911 >
22912 > I don't know what you guys are _trying_ to work on. Everything in the
22913 > I18N section of "Architecting XEmacs" is red-flagged. OTOH, it's
22914 > clear from your posts that you are overburdened, so I can't read
22915 > priority into the fact that you've responded to specific issues in the
22916 > past.
22917
22918 I wrote this awhile ago:
22919
22920
22921 >
22922 > Ben> The big question is, would you be willing to help do the
22923 > Ben> actual implementation, to "be my hands"?
22924 >
22925 > Sure, subject to the usual caveat that I'd need to be convinced it's
22926 > worth doing and a secondary caveat that I am not an experienced coder.
22927
22928 If you'll implement it, I'll design it. It's more a case of will on your part
22929 than anything else. I can give you instructions sufficient enough to match
22930 your level of expertise.
22931
22932 ben
22933
22934 >
22935 >
22936 > --
22937 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
22938 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
22939 > _________________ _________________ _________________ _________________
22940 > What are those straight lines for? "XEmacs rules."
22941
22942 --
22943 In order to save my hands, I am cutting back on my responses, especially
22944 to XEmacs-related mail. You _will_ get a response, but please be patient.
22945 If you need an immediate response and it is not apparent in your message,
22946 please say so. Thanks for your understanding.
22947
22948
22949
22950 -----------------------------------------------------------------------------
22951
22952 Dec 20, 1999
22953
22954
22955 Implementing message translation is not that hard. I've already done a lot of
22956 preliminary work in places such as @file{make-msgfile.lex} in lib-src/. Finishing up
22957 the work is not that big a task; I already know exactly how it should be
22958 done. Perhaps I'll write up detailed design instructions for this, as I'm
22959 doing for other things. A lot depends on priority: How important do you think
22960 this issue is to your average Japanese/Chinese/etc. user? How does it compare
22961 to some of the other important Mule issues that Martin and I are (trying to
22962 work) on? If I did the design document, would you be willing to do the
22963 necessary bit of C hackery to implement the document? If the design document
22964 is not specific enough for you, I can give you an "implementation document"
22965 which will definitely be specific enough: i.e. I'll show you exactly where the
22966 code needs to be modified, and how. The big question is, would you be willing
22967 to help do the actual implementation, to "be my hands"?
22968
22969 ---------------------------------------------------------------------------
22970
22971 From:
22972 Ben Wing <ben@@666.com>
22973 12/14/1999 11:00 PM
22974
22975 Subject:
22976 Re: Mule UI disaster: displaying character tables
22977 To:
22978 Hrvoje Niksic <hniksic@@iskon.hr>
22979 CC:
22980 XEmacs vs Mule <xemacs-mule@@xemacs.org>
22981
22982
22983
22984
22985 What I mean is, please put my name in the header, as well as xemacs-mule.
22986 That way I'll see it in my personal box.
22987
22988 I agree that Mule has problems, but:
22989
22990 Brokenness can be fixed.
22991 Slowness can be fixed.
22992 Limitations can be fixed.
22993
22994 The design limitation you mention below, for example, is not really very
22995 hard to change.
22996
22997 Keep in mind that I pretty much rewrote Mule from scratch, and did it
22998 @strong{all} in 6-7 months. In comparison with that, the changes below are
22999 pretty minor, and each could be done by a good (and able-bodied!)
23000 programmer familiar with the Mule code in less than a week -- to the
23001 XEmacs code, at least. The problem is, everyone who could do this work is
23002 instead spending their time complaining about Mule problems instead of
23003 doing things.
23004
23005 I'll gladly help out anyone who wants to do Mule coding by explaining all
23006 the details; I'll even write a "Mule internals manual", if that will
23007 help. I can also make international phone calls -- they're cheap here in
23008 the US due to the long distance wars. But so far no one has asked me for
23009 help or shown any willingness to do any work on Mule.
23010
23011 Perhaps people are daunted by the seeming vastness of the problems. But I
23012 wager that if I had another 6 months to work on nothing but Mule, it would
23013 be nearly perfect. The basic design of the XEmacs C code is good;
23014 incremental changes, without over-much concern for compatibility, could
23015 make huge strides in a short amount of time (as was the case the whole
23016 time I worked on it, esp. towards the end -- it didn't even @strong{compile} for
23017 4 months!). A "total rewrite" would be an incredible waste of time.
23018
23019 Again, I'm completely willing to provide help, documentation, design
23020 improvement suggestions (ala Architecting XEmacs -- which seems to have
23021 been completely ignored, alas), etc.
23022
23023 ben
23024
23025 Hrvoje Niksic wrote:
23026
23027 > Ben Wing <ben@@666.com> writes:
23028 >
23029 > > I'm the one who did most of the Mule work in XEmacs, so if you have
23030 > > any questions about the core, please address them to me directly. I
23031 > > can probably give you a very clear and detailed answer.
23032 >
23033 > Thanks. I think it still makes sense to ask here, so that other
23034 > developer have a chance to chime in.
23035 >
23036 > > However, I need some explanation. What's misdesigned that you're
23037 > > complaining about? And what's the coding-system disaster?
23038 >
23039 > It's been spoken of a lot. Basically:
23040 >
23041 > * Unlike XEmacs/no-Mule, XEmacs/Mule doesn't preserve binary files in
23042 > Latin 2 locales by default. This is annoying for users who are used
23043 > to XEmacs/no-Mule.
23044 >
23045 > * XEmacs/Mule is much slower than XEmacs, and not only because of
23046 > character/byte conversions. It seems that font lookups etc. are
23047 > slower.
23048 >
23049 > * The "coding-system disaster" refers to inherent limitations of the
23050 > coding-system model. If I understand things correctly,
23051 > coding-systems convert streams of bytes to streams of Emchars. It
23052 > does not appear to be possible to create a "gzip" coding system for
23053 > handling gzipped file. Even EOL conversions look kludgish:
23054 >
23055 > iso-2022-8
23056 > iso-2022-8-dos
23057 > iso-2022-8-mac
23058 > iso-2022-8-unix
23059 > iso-2022-8bit-ss2
23060 > iso-2022-8bit-ss2-dos
23061 > iso-2022-8bit-ss2-mac
23062 > iso-2022-8bit-ss2-unix
23063 > iso-2022-int-1
23064 > iso-2022-int-1-dos
23065 > iso-2022-int-1-mac
23066 > iso-2022-int-1-unix
23067 >
23068 > Ideally, it should be possible to specify a stream of
23069 > coding-systems, where only the last one converts to actual Emchars.
23070 >
23071 > There are more problems I don't remember right now. Many many usage
23072 > problems become apparent when I stand and look over the shoulders of
23073 > an XEmacs users who tries to use Mule.
23074
23075 --
23076 In order to save my hands, I am cutting back on my responses, especially
23077 to XEmacs-related mail. You _will_ get a response, but please be patient.
23078
23079 If you need an immediate response and it is not apparent in your message,
23080 please say so. Thanks for your understanding.
23081
23082
23083
23084 -----------------------------------------------------------------------
23085
23086
23087
23088
23089 From:
23090 Ben Wing <ben@@666.com>
23091 12/14/1999 12:20 AM
23092
23093 Subject:
23094 Re: Mule UI disaster: displaying character tables
23095 To:
23096 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp>
23097 CC:
23098 XEmacs vs Mule <xemacs-mule@@xemacs.org>
23099
23100
23101
23102
23103 I think you should go ahead with your proposal, and assume it will get
23104 implemented. I don't think Martin is really suggesting that API changes not
23105 be allowed, but just that they proceed in a somewhat orderly fashion; and in
23106 any case, I imagine I have final say in cases of Mule-related conflicts.
23107
23108 ben
23109
23110 "Stephen J. Turnbull" wrote:
23111
23112 > >>>>> "Hrvoje" == Hrvoje Niksic <hniksic@@iskon.hr> writes:
23113 >
23114 > Hrvoje> So next I tried the "Mule" menu. That's right, boys and
23115 > Hrvoje> girls, I've never looked at it before.
23116 >
23117 > For quite a while, it didn't work at all, led to crashes and other
23118 > warm/fuzzy things. IIRC there used to be a top level menu item
23119 > pointing to information about the current language environment but it
23120 > got removed.
23121 >
23122 > Hrvoje> Wow. Seeing shift_jis, iso-2022 variants and (above all
23123 > Hrvoje> things) big5 makes me really warm and fuzzy.
23124 >
23125 > We've been through this recently---you were there. We know what to do
23126 > about it, basically (Ben liked my proposal, and it would fix this
23127 > silliness as well as the binary file breakage). But given that Ben
23128 > and Martin seem to have different ideas about where to go with Mule
23129 > (Ben seemed to be supporting API and implementation revisions, Martin
23130 > evidently wants to keep the current Mule), working on that proposal is
23131 > possibly a waste of time. I've got other stuff on my plate and I'll
23132 > get back to it one of these days (not tomorrow but sooner than Real
23133 > Soon Now).
23134 >
23135 > Hrvoje> The items it presents (leading to further submenus) are:
23136 >
23137 > Hrvoje> 94 character set
23138 > Hrvoje> 94 x 94 character set
23139 > Hrvoje> 96 character set
23140 >
23141 > This _is_ bad UI, now that you point it out. But it is quite natural
23142 > for a coding system lawyer (as all Japanese users have to be), I never
23143 > noticed it before. Easy enough to fix ("raise my karma").
23144 >
23145 > Hrvoje> But I do bear some Mule scars, so I happily select "96
23146 > Hrvoje> character sets", then ISO8859-2. And I get this:
23147 >
23148 > [Table omitted]
23149 >
23150 > Hrvoje> So me wonders: what the hell is this?
23151 >
23152 > Huh? That is the standard table that you see over and over again in
23153 > references. I'll believe you if you say you've never seen one before,
23154 > but every Japanese users' manual has dozens of pages of those, using
23155 > exactly that format.
23156 >
23157 > The presentation in the range 00--7F is not unreasonable for Latin 2;
23158 > ISO-8859 is a version of ISO-2022, so the high bit should not be
23159 > interpreted as "+ x80" (technically speaking), it should be
23160 > interpreted as a character set shift.
23161 >
23162 > Of course, this doesn't make sense to anybody but a character set
23163 > lawyer, and so should be changed. Especially since the header refers
23164 > to ISO-8859-2 which everybody these days thinks of as _one, 8-bit_
23165 > character set, not two 7-bit ones.
23166 >
23167 > As for the "Japanese" in the table, that's just a really stupid
23168 > "optimization": those happen to be line-drawing characters available
23169 > in JIS X 0208, to make pretty borders. Substitute "-", "+", and "|"
23170 > in appropriate places to make ugly but portable borders.
23171 >
23172 > Hrvoje> Mule is just broken. Warn your friends.
23173 >
23174 > Hrvoje is on the rampage again. Warn your friends ;-)
23175 >
23176 > --
23177 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
23178 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
23179 > _________________ _________________ _________________ _________________
23180 > What are those straight lines for? "XEmacs rules."
23181
23182 --
23183 In order to save my hands, I am cutting back on my responses, especially
23184 to XEmacs-related mail. You _will_ get a response, but please be patient.
23185 If you need an immediate response and it is not apparent in your message,
23186 please say so. Thanks for your understanding.
23187
23188
23189
23190 ---------------------------------------------------------------------------
23191
23192 From:
23193 Ben Wing <ben@@666.com>
23194 12/14/1999 10:28 PM
23195
23196 Subject:
23197 Re: Autodetect proposal; specifer questions/suggestions
23198 To:
23199 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp>
23200
23201
23202
23203
23204 I've always thought the specifier API is too complicated (and too
23205 "write-only"), but I went back at one point well after I designed it and I
23206 couldn't figure out an obvious way to simplify it that still kept reasonable
23207 functionality. Perhaps that's what Custom did, and why it turned out bad.
23208
23209 Inefficiency is a stupid reason not to use them. They seem efficient enough
23210 for redisplay. Changing them might be inefficient, but Emacs Lisp is in
23211 general, right?
23212
23213 Can you propose an API or functionality change that will make them more used?
23214
23215
23216
23217 "Stephen J. Turnbull" wrote:
23218
23219 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes:
23220 >
23221 > Ben> I think you should go ahead with your proposal, and assume it
23222 > Ben> will get implemented.
23223 >
23224 > OK. "yas baas" ;-)
23225 >
23226 > On something totally different. I'm really bothered by the fact that
23227 > specifiers are so little used (eg, Custom reimplements them badly),
23228 > and the fact that every package seems to define its own set of faces
23229 > (or whatever), rather than use the specifier mechanism to inherit from
23230 > existing ones, or add new specifications to existing ones. API problem?
23231 >
23232 > Also, faces (maybe specifiers in general?) should have an autoload
23233 > mechanism, and a @file{<package>-faces.el} (or @file{<package>-specifiers.el})
23234 > convention. There are a number of faces in (eg) Custom that I like to
23235 > use, but I have to load Custom to get them. And Custom should be able
23236 > to somehow see all the faces in various packages available, even when
23237 > they are not loaded.
23238 >
23239 > I've seen claims that specifiers aren't very efficient.
23240 >
23241 > Opinions?
23242 >
23243 > --
23244 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
23245 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
23246 > _________________ _________________ _________________ _________________
23247 > What are those straight lines for? "XEmacs rules."
23248
23249 --
23250 In order to save my hands, I am cutting back on my responses, especially
23251 to XEmacs-related mail. You _will_ get a response, but please be patient.
23252 If you need an immediate response and it is not apparent in your message,
23253 please say so. Thanks for your understanding.
23254
23255
23256 -----------------------------------------------------------------------------
23257 From:
23258 Ben Wing <ben@@666.com>
23259 11/18/1999 9:02 PM
23260
23261 Subject:
23262 Re: Char-related crashes (hopefully) fixed
23263 To:
23264 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp>
23265 CC:
23266 XEmacs Beta List <xemacs-beta@@xemacs.org>
23267
23268
23269
23270
23271 OK, in summation:
23272
23273 1. C-q is a user-level function and should do whatever makes the most sense.
23274 2. int-char is a low-level primitive and should never depend on high-level
23275 settings like language environment.
23276 3. Everything you can do with int-char can and should be done with make-char
23277 -- representation-independent, much less likelihood of bugs, etc. Therefore
23278 int-char should be removed.
23279 4. Note that CLTL2 also removes int-char.
23280 5. Your statement
23281
23282 > In one-byte buffers (either Olivier's 1/2/4 extension or `xemacs -font
23283 > *-iso8859-2') it implicitly will have dependence whatever you say.
23284
23285 is confusing internal and external representations.
23286
23287 ben
23288
23289 "Stephen J. Turnbull" wrote:
23290
23291 > Can somebody give a bunch of examples where using integers as
23292 > characters is useful? For that matter, where they are actually used?
23293 > Ben said "backward compatibility," but I haven't seen this used, and I
23294 > don't really know how to grep for it. I have grepped for int-char,
23295 > int-to-char, char-int, and char-to-int and they're pretty rare in the
23296 > core and package code (2/3 of it) that I have.
23297 >
23298 > The only one that I ever use is the C-q hack for inserting characters
23299 > by code value at the keyboard, and that could arguably (and in
23300 > Japanese invariably is) delegated to an input method which would know
23301 > about language environment (and return a true character).
23302 >
23303 > For iterating over a character set in "natural" order, only ASCII
23304 > satisfies the requirement of having one, and even that's shaky. AFAIK
23305 > the Swedes and the Norwegians, or is it the Danes, disagree on
23306 > ordering the _letters_ in ISO-8859-1 character set. This really
23307 > should be table-driven, and will have to be for everything except
23308 > ASCII and ISO-8859-1 if we go to a Unicode internal representation.
23309 >
23310 > We already have primitives for efficient case conversion and the like.
23311 >
23312 > The only example I can think of offhand where you would really really
23313 > want the facility is to iterate over a code space where you don't know
23314 > which points are legal characters. Eg, to print out tables of fonts.
23315 > Pretty specialized. And this can be done through make-char, anyway.
23316 >
23317 > According to CLtL1, the main portable use for char-int is for hashing.
23318 > But that doesn't square with the kind of usage we've been talking
23319 > about (in loops and the like).
23320 >
23321 > What else am I missing?
23322 >
23323 > Ben's desiderata have some problems.
23324 >
23325 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes:
23326 >
23327 > Ben> Either int-char should be the mirror opposite of char-int
23328 > Ben> (i.e. accept all legal char integers), or it should be
23329 > Ben> removed entirely.
23330 >
23331 > OK. I agree with this.
23332 >
23333 > Ben> int-char should @strong{never} have any dependence on the language
23334 > Ben> environment.
23335 >
23336 > In one-byte buffers (either Olivier's 1/2/4 extension or `xemacs -font
23337 > *-iso8859-2') it implicitly will have dependence whatever you say.
23338 > Even without Mule, people can always use external encoders to change
23339 > raw ISO-8859-2 to ISO-2022 (not that anybody sane ever would, OK,
23340 > Hrvoje?). Then the two files will be interpreted differently in a
23341 > Latin-1 locale Mule; the ISO-8859-2 file will be recognized as
23342 > ISO-8859-1, and the ISO-2022 file will be internally interpreted as
23343 > ISO-8859-2.
23344 >
23345 > The point is that people normally assume that int-char should accept
23346 > their "natural" integer to character map. For Americans, that's
23347 > ASCII, for Germans, that's ISO-8859-1, for Croatians, that's
23348 > ISO-8859-2. And it works "correctly" in a no-mule XEmacs with `-font
23349 > *-iso8859-2'! Japanese usually use ku-ten or JIS, and there's a
23350 > "natural" map from byte-sized integer pairs to shorts, but it's full
23351 > of holes. So language environments don't agree on what a legal char
23352 > integer is, and where they do (eg, ISO-8859-1 and ISO-8859-2), they
23353 > don't agree on the map. To satisfy your dictum (with which I agree,
23354 > but I take to mean we should get rid of these functions) we can take
23355 > the intersection where they agree
23356 >
23357 > ==> legal char integers == ASCII
23358 >
23359 > which is what I prefer, or pick something arbitrary and efficient
23360 >
23361 > ==> char-int returns the internal representation
23362 >
23363 > which I really hate, or something else. Suggestions?
23364 >
23365 > Ben> I don't think C-q should either. If Hrvoje wants to insert
23366 > Ben> Latin-2 characters by number, then make C-u C-q work so that
23367 > Ben> it also prompts for a character set, with a default chosen
23368 > Ben> from the language environment.
23369 >
23370 > And restrict this to ASCII? Or assume Latin-1 in GR if there is no
23371 > prefix argument?
23372 >
23373 > This is a useful feature. C-q currently inserts Latin-2 characters
23374 > for Hrvoje in no-mule XEmacs (stretching the point only a little); I
23375 > think it should continue to do so in Mule. This really is an input
23376 > method issue, not a keyboard issue. In XEmacs, inserting an integer
23377 > into a buffer has no meaning. Users insert characters. So this is a
23378 > completely different issue from the programming API, and should not be
23379 > considered analogous.
23380 >
23381 > Maybe we could have C-q insert according to the Unicode standard, and
23382 > treat C-u C-q as part of the input method. But I think most users
23383 > would prefer to have C-q insert according to their locale-standard
23384 > tables, and select Unicode explicitly using the C-u C-q idiom. In
23385 > fact (again this points to the input method idea), Japanese users
23386 > would probably like to have the alternatives of using kuten (pairs
23387 > from 1--94 x 1--94) or JIS (pairs from 0x21--0x7E x 0x21--0x7E) as
23388 > options since both indexing systems are common in tables.
23389 >
23390 > --
23391 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
23392 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
23393 > __________________________________________________________________________
23394 > __________________________________________________________________________
23395 > What are those two straight lines for? "Free software rules."
23396
23397 --
23398 ben
23399
23400 --
23401 In order to save my hands, I am cutting back on my responses, especially to
23402 XEmacs-related mail. You
23403 _will_ get a response, but please be patient. If you need an immediate
23404 response and it’s not apparent in
23405 your message, please say so. Thanks for your understanding.
23406
23407
23408
23409 -----------------------------------------------------------------------------
23410
23411 From:
23412 Ben Wing <ben@@666.com>
23413 11/16/1999 11:03 PM
23414
23415 Subject:
23416 Re: Char-related crashes (hopefully) fixed
23417 To:
23418 Yoshiki Hayashi <t90553@@m.ecc.u-tokyo.ac.jp>
23419 CC:
23420 Hrvoje Niksic <hniksic@@iskon.hr>,
23421 XEmacs Beta List <xemacs-beta@@xemacs.org>
23422
23423
23424
23425
23426 Either int-char should be the mirror opposite of char-int (i.e. accept all
23427 legal char integers), or it should be removed entirely.
23428
23429 int-char should @strong{never} have any dependence on the language environment.
23430
23431 I don't think C-q should either. If Hrvoje wants to insert Latin-2
23432 characters by number, then make C-u C-q work so that it also prompts for a
23433 character set, with a default chosen from the language environment.
23434
23435 ben
23436
23437 Yoshiki Hayashi wrote:
23438
23439 > Hrvoje Niksic <hniksic@@iskon.hr> writes:
23440 >
23441 > > As Ben said, now that we've fixed the actual bugs, we can think about
23442 > > changing the behaviour for int-char conversions for 21.2.
23443 >
23444 > Following are proposed which integers should be accepted
23445 > where characters are expected:
23446 >
23447 > 1) Don't allow anything
23448 > 2) Accept 0-127
23449 > 3) Accept 0-256
23450 > 4) Accept everything
23451 >
23452 > Other things proposed are:
23453 >
23454 > a) When doing C-q, treat 128-256 as Latin-2 in Latin 2
23455 > language environment.
23456 >
23457 > So far, most of the proposal is intended to apply to every
23458 > int-char conversions, I'd like to make some functions to
23459 > accept.
23460 >
23461 > My plan is:
23462 > Accept only 0-256 in every place except int-to-char.
23463 > int-to-char accepts every valid integers.
23464 > Make new function which does int-to-char conversion
23465 > correctly according to the language environment.
23466 >
23467 > This way, most of the code which does (insert (1+ ?a)) or
23468 > something continues working. Now internal representation is
23469 > changed a little bit, so disabling > 256 characters will
23470 > warn those who are dealing with internal representation
23471 > directly, which is bad. Still, you can do
23472 > (let ((i 1442))
23473 > (while (i < 2000)
23474 > (insert (int-to-char i))
23475 > (setq i (+1 i))))
23476 > to achieve old behaviour.
23477 >
23478 > For C-q, I'm not for changing it's original definition,
23479 > since it might confuse people who are expecting Latin-1 in
23480 > other language environment and typing just 1 integer doesn't
23481 > make sense for multibyte world. It's cleaner to make new
23482 > function, which does make-char according to the charset of
23483 > language-info-alist so that people who use that often can
23484 > bind it to C-q or some other keys.
23485 >
23486 > --
23487 > Yoshiki Hayashi
23488
23489 --
23490 ben
23491
23492 --
23493 In order to save my hands, I am cutting back on my responses, especially to
23494 XEmacs-related mail. You
23495 _will_ get a response, but please be patient. If you need an immediate
23496 response and it’s not apparent in
23497 your message, please say so. Thanks for your understanding.
23498
23499
23500
23501 @end example
23502
23503 @node Discussion -- Windows External Widget, Discussion -- Packages, Discussion -- Multilingual Issues, Future Work Discussion
23504 @section Discussion -- Windows External Widget
23505 @cindex discussion, windows external widget
23506 @cindex windows external widget, discussion
23507
23508 @example
23509
23510 Subject:
23511 Re: External Widget Support for Xemacs on nt
23512 Date:
23513 Sat, 08 Jul 2000 01:47:14 -0700
23514 From:
23515 Ben Wing <ben@@666.com>
23516 To:
23517 Timothy.Fowler@@msdw.com
23518 CC:
23519 xemacs-nt@@xemacs.org
23520 References:
23521 1
23522
23523
23524
23525
23526 Nothing is currently done for external widget support under XEmacs but it should
23527 not be too hard to do and would be a great addition to XEmacs. What you would
23528 probably want to do is create an XEmacs control that has an interface something
23529 like the built-in edit control and which communicates to an existing XEmacs
23530 process using DDE. (Basically you would modify XEmacs so that it registered
23531 itself as a DDE server accepting external widget requests, and then the external
23532 edit control would simply send a DDE request and the result would be a handle of
23533 some sort used for future communication with that particular XEmacs process.)
23534
23535 There are two basic issues in getting the external widget to work, which are
23536 display and input. Although I am not completely sure, I have a feeling that it
23537 is possible for one process to write into the window of another process, simply
23538 by using that window's HWND handle. If so it should be extremely easy to get the
23539 output working (this is exactly the approach used under Xt). For input, you
23540 would probably again want to do what is done under Xt, which is that the client
23541 widget simply passes all of the appropriate messages to the XEmacs server
23542 process using whatever communication channel was set up, e.g. DDE, and the
23543 XEmacs server processes them normally. Very few modifications would be needed to
23544 the XEmacs source code and all of the necessary modifications could be done
23545 simply by looking for existing external widget code in XEmacs.
23546
23547 If you are interested in continuing this, I will certainly give you any support
23548 you need along the way. This would be a great project to be added to XEmacs.
23549
23550
23551
23552 Timothy Fowler wrote:
23553
23554 > I am looking into external widget support for xemacs nt similar to that
23555 > existing in xemacs for X
23556 > Have any developement efforts been made in this direction in the past?
23557 > Is there any current effort?
23558 > Any insight into the complexity of achieving this?
23559 > Any comments would be greatly appreciated
23560 > Thanks
23561 > Tim Fowler
23562
23563 --
23564 Ben
23565
23566 In order to save my hands, I am cutting back on my mail. I also write
23567 as succinctly as possible -- please don't be offended. If you send me
23568 mail, you _will_ get a response, but please be patient, especially for
23569 XEmacs-related mail. If you need an immediate response and it is not
23570 apparent in your message, please say so. Thanks for your understanding.
23571
23572 See also http://www.666.com/ben/chronic-pain/
23573
23574
23575 Subject:
23576 RE: External Widget Support for Xemacs on nt
23577 Date:
23578 Mon, 10 Jul 2000 12:40:01 +0100
23579 From:
23580 "Alastair J. Houghton" <ajhoughton@@lineone.net>
23581 To:
23582 "Ben Wing" <ben@@666.com>, <xemacs-nt@@xemacs.org>
23583 CC:
23584 <Timothy.Fowler@@msdw.com>
23585
23586
23587
23588
23589 > -----Original Message-----
23590 > From: owner-xemacs-nt@@xemacs.org [mailto:owner-xemacs-nt@@xemacs.org]On
23591 > Behalf Of Ben Wing
23592 > Sent: 08 July 2000 09:47
23593 > To: Timothy.Fowler@@msdw.com
23594 > Cc: xemacs-nt@@xemacs.org
23595 > Subject: Re: External Widget Support for Xemacs on nt
23596 >
23597 > Nothing is currently done for external widget support under
23598 > XEmacs but it should
23599 > not be too hard to do and would be a great addition to XEmacs.
23600 > What you would
23601 > probably want to do is create an XEmacs control that has an
23602 > interface something
23603 > like the built-in edit control and which communicates to an
23604 > existing XEmacs
23605 > process using DDE.
23606
23607 It would be @strong{much} better to use RPC or COM rather than DDE - and
23608 also it would provide a more useful interface to XEmacs (like the
23609 Microsoft rich text edit control that is used by Wordpad). It
23610 would probably also be easier...
23611
23612 > If you are interested in continuing this, I will certainly give
23613 > you any support
23614 > you need along the way. This would be a great project to be added
23615 > to XEmacs.
23616
23617 I agree. This would be a *really useful* thing to do...
23618
23619 Regards,
23620
23621 Alastair.
23622
23623 ____________________________________________________________
23624 Alastair Houghton ajhoughton@@lineone.net
23625
23626 Subject:
23627 Re: External Widget Support for Xemacs on nt
23628 Date:
23629 Mon, 10 Jul 2000 22:56:06 -0700
23630 From:
23631 Ben Wing <ben@@666.com>
23632 To:
23633 "Alastair J. Houghton" <ajhoughton@@lineone.net>
23634 CC:
23635 xemacs-nt@@xemacs.org, Timothy.Fowler@@msdw.com
23636 References:
23637 1
23638
23639
23640
23641
23642 sounds good. i don't know too much about windows ipc methods, so i suggested
23643 dde just as an example.
23644
23645 "Alastair J. Houghton" wrote:
23646
23647 > > -----Original Message-----
23648 > > From: owner-xemacs-nt@@xemacs.org [mailto:owner-xemacs-nt@@xemacs.org]On
23649 > > Behalf Of Ben Wing
23650 > > Sent: 08 July 2000 09:47
23651 > > To: Timothy.Fowler@@msdw.com
23652 > > Cc: xemacs-nt@@xemacs.org
23653 > > Subject: Re: External Widget Support for Xemacs on nt
23654 > >
23655 > > Nothing is currently done for external widget support under
23656 > > XEmacs but it should
23657 > > not be too hard to do and would be a great addition to XEmacs.
23658 > > What you would
23659 > > probably want to do is create an XEmacs control that has an
23660 > > interface something
23661 > > like the built-in edit control and which communicates to an
23662 > > existing XEmacs
23663 > > process using DDE.
23664 >
23665 > It would be @strong{much} better to use RPC or COM rather than DDE - and
23666 > also it would provide a more useful interface to XEmacs (like the
23667 > Microsoft rich text edit control that is used by Wordpad). It
23668 > would probably also be easier...
23669 >
23670 > > If you are interested in continuing this, I will certainly give
23671 > > you any support
23672 > > you need along the way. This would be a great project to be added
23673 > > to XEmacs.
23674 >
23675 > I agree. This would be a *really useful* thing to do...
23676 >
23677 > Regards,
23678 >
23679 > Alastair.
23680 >
23681 > ____________________________________________________________
23682 > Alastair Houghton ajhoughton@@lineone.net
23683
23684 --
23685 Ben
23686
23687 In order to save my hands, I am cutting back on my mail. I also write
23688 as succinctly as possible -- please don't be offended. If you send me
23689 mail, you _will_ get a response, but please be patient, especially for
23690 XEmacs-related mail. If you need an immediate response and it is not
23691 apparent in your message, please say so. Thanks for your understanding.
23692
23693 See also http://www.666.com/ben/chronic-pain/
23694
23695 @end example
23696
23697
23698 @node Discussion -- Packages, Discussion -- Distribution Layout, Discussion -- Windows External Widget, Future Work Discussion
23699 @section Discussion -- Packages
23700 @cindex discussion, packages
23701 @cindex packages, discussion
23702
23703 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
23704
23705 @subheading Important package-related changes
23706
23707 This file details changes that make the package system no longer an
23708 unmitigated disaster. This way, at the very least, people can
23709 essentially ignore the package system and not get bitten horribly the
23710 way they currently do.
23711
23712 @enumerate
23713 @item
23714 A single tarball containing absolutely everything and named
23715 xemacs-21.2.68.tar.gz. This must contain absolutely everything,
23716 including all of the packages, and in the proper directory
23717 structure, so that the paradigm for
23718
23719 untar; configure; make; make install
23720
23721 just works.
23722
23723 @item
23724 Fixed startup slowdown when all packages are installed so that
23725 there is absolutely no penalty to having them all installed. This
23726 may be hard.
23727
23728 @item
23729 All files on the ftp site should be accessible through http.
23730
23731 @item
23732 Put symlinks into the distribution directory to the appropriate
23733 files in the package directory.
23734
23735 @item
23736 Eliminate the confusing SUMO name, choosing a much more obvious
23737 name such as all-packages.
23738
23739 @item
23740 There should be no separation of mule and non-mule packages.
23741
23742 @item
23743 Having 2 packages that conflict with each other should be
23744 completely disallowed.
23745
23746 @item
23747 Fix vc and ps-print so that there is only ONE version.
23748
23749 @item
23750 Fix up all of the READMEs on the distribution site to make it
23751 abundantly clear what needs to be obtained, where to get it, and
23752 how to install it, especially with regards to packages.
23753 @end enumerate
23754
23755 @node Discussion -- Distribution Layout, , Discussion -- Packages, Future Work Discussion
23756 @section Discussion -- Distribution Layout
23757 @cindex discussion, distribution layout
23758 @cindex distribution layout, discussion
23759
23760
23761 @example
23762 From:
23763 Ben Wing <ben@@666.com>
23764 10/15/1999 8:50 PM
23765
23766 Subject:
23767 VOTE: Absolutely necessary changes to file naming in releases
23768 To:
23769 SL Baur <steve@@xemacs.org>,
23770 XEmacs Reviews <xemacs-review@@xemacs.org>
23771
23772
23773
23774
23775 Everybody except Steve seems to agree that we need to provide a single
23776 tar file containing the entire XEmacs tree whenever we release a new
23777 version of XEmacs (beta or not). Therefore I propose the following
23778 simple changes, and ask for a vote. If it is the general will of the
23779 developers, then Steve @strong{WILL} make these changes. This is the
23780 definition of cooperative development -- no one, not even the
23781 maintainer, can assert absolute power over anything.
23782
23783 I propose (assuming, for example, release 21.2.20):
23784
23785 1. xemacs-21.2.20.tar.gz -> xemacs-21.2.20-core.tar.gz
23786
23787 2. xemacs-sumo.tar.gz -> xemacs-packages.tar.gz
23788
23789 3. xemacs-mule-sumo.tar.gz -> xemacs-mule-packages.tar.gz
23790
23791 4. Symlinks to the files mentioned in #2 and #3 get created in the SAME
23792 directory as xemacs-21.2.20-*.tar.gz.
23793
23794 5. MOST IMPORTANTLY, a new file xemacs-21.2.20.tar.gz gets created,
23795 which is the combination of the 5 files xemacs-21.2.20-core.tar.gz,
23796 xemacs-21.2.20-elc.tar.gz, xemacs-21.2.20-info.tar.gz,
23797 xemacs-packages.tar.gz, and xemacs-mule-packages.tar.gz.
23798
23799
23800 The directory structure of the new combined file xemacs-21.2.20.tar.gz
23801 would look like this:
23802
23803 xemacs-21.2.20/
23804 xemacs-packages/
23805 xemacs-mule-packages/
23806
23807
23808 I am sorry to shout, but the current situation is just completely
23809 insane.
23810
23811 ben
23812
23813
23814
23815
23816
23817
23818 From:
23819 Ben Wing <ben@@666.com>
23820 10/16/1999 3:12 AM
23821
23822 Subject:
23823 Re: VOTE: Absolutely necessary changes to file naming in releases
23824 To:
23825 SL Baur <steve@@xemacs.org>,
23826 XEmacs Reviews <xemacs-review@@xemacs.org>,
23827 "Michael Sperber [Mr. Preprocessor]" <sperber@@informatik.uni-tuebingen.de>
23828
23829
23830
23831
23832 Something went wrong with my mail program while I was responding, so
23833 Michael's response is not quoted here.
23834
23835 Let me rephrase my proposal, stressing the important points in order of
23836 importance:
23837
23838 1. MOST IMPORTANT: There MUST be a SINGLE tar file containing the complete
23839 XEmacs sources, packages, etc. The name of this tar file must have a
23840 format like this:
23841
23842 xemacs-21.2.10.tar.gz
23843
23844 The directory layout of the packages within it is not important as long as
23845 it works: The user who downloads the tar file MUST be able to apply the
23846 'configure; make; make install' paradigm at the top-level directory and
23847 have it work properly.
23848
23849 2. All the pieces of XEmacs must be in the @strong{same} subdirectory on the FTP
23850 site.
23851
23852 3. The names need to be obvious and standard. Naming the core files
23853 "xemacs-21.2.20.tar.gz" is non-standard because those are only the core
23854 files. The standard followed by everybody in the world is that a name like
23855 this refers to the entire product, with all ancillary files. Also, "sumo",
23856 although a nice in-joke, is extremely confusing and needs to go.
23857
23858 Referring to Michael's point about the layout I proposed, I also think that
23859 the package system needs to be modified to accept a layout produced by the
23860 "obvious" way of obtaining and untarring the parts, which leaves you with a
23861 directory consisting of
23862
23863 xemacs-21.2.19/
23864 xemacs-packages/
23865 mule-packages/
23866
23867 All at the same level. However, this is an independent issue from the vote
23868 at hand.
23869
23870
23871 Consider the current insanity. The new XEmacs user or beta tester goes to
23872 the FTP site, looks around, finds the file xemacs-21.2.19.tar.gz, and
23873 downloads it, because it looks like the obvious one to get. But it doesn't
23874 work. Oops ... He looks some more and finds the other two -elc and -info
23875 parts, grabs them, and then tries again. But it still doesn't work. He
23876 manages to overhear something about packages, so he looks for them, but
23877 doesn't find them immediately (they're not even in the beta tree, though
23878 they obviously contain beta-level code, especially in xemacs-base and
23879 mule-base). Eventually he discovers the package/ subdirectory, but what
23880 the hell does he do there? There's no README at all there giving any
23881 clues, so he downloads everything. Along with this, he gets some files
23882 called "sumo", which he doesn't understand, but he notices that some of
23883 them are extremely large. "sumo" ... "large" ... hehe, I get it. Some
23884 silly developer's joke. But then he tries again to compile things, and
23885 just can't figure things out. He still doesn't know:
23886
23887 -- "sumo" is not just some large file, but is a tar file of all the
23888 packages.
23889 -- The packages can't be placed is any subdirectory in any obvious relation
23890 to the XEmacs directory ("straight out of the box" if you manage to grok
23891 the significance of the sumo files, you get a layout like
23892
23893 xemacs-21.2.19/
23894 xemacs-packages/
23895 mule-packages/
23896
23897 which naturally doesn't work! He needs to put them underneath
23898 xemacs-21.2.19/lib/xemacs/ or something.)
23899
23900 At this point, he gives up, and (if he was a user of a pre-packagized
23901 XEmacs) wonders in despair how things got so messed up, when all older
23902 XEmacs releases, including all the betas, followed the standard "configure;
23903 make; make install" paradigm).
23904
23905
23906
23907 Soooooo ......... PLEASE vote on issues #1-3 above, and add any comments
23908 you feel like adding.
23909
23910 ben
23911
23912 Ben Wing wrote:
23913
23914 > Everybody except Steve seems to agree that we need to provide a single
23915 > tar file containing the entire XEmacs tree whenever we release a new
23916 > version of XEmacs (beta or not). Therefore I propose the following
23917 > simple changes, and ask for a vote. If it is the general will of the
23918 > developers, then Steve @strong{WILL} make these changes. This is the
23919 > definition of cooperative development -- no one, not even the
23920 > maintainer, can assert absolute power over anything.
23921 >
23922 > I propose (assuming, for example, release 21.2.20):
23923 >
23924 > 1. xemacs-21.2.20.tar.gz -> xemacs-21.2.20-core.tar.gz
23925 >
23926 > 2. xemacs-sumo.tar.gz -> xemacs-packages.tar.gz
23927 >
23928 > 3. xemacs-mule-sumo.tar.gz -> xemacs-mule-packages.tar.gz
23929 >
23930 > 4. Symlinks to the files mentioned in #2 and #3 get created in the SAME
23931 > directory as xemacs-21.2.20-*.tar.gz.
23932 >
23933 > 5. MOST IMPORTANTLY, a new file xemacs-21.2.20.tar.gz gets created,
23934 > which is the combination of the 5 files xemacs-21.2.20-core.tar.gz,
23935 > xemacs-21.2.20-elc.tar.gz, xemacs-21.2.20-info.tar.gz,
23936 > xemacs-packages.tar.gz, and xemacs-mule-packages.tar.gz.
23937 >
23938 > The directory structure of the new combined file xemacs-21.2.20.tar.gz
23939 > would look like this:
23940 >
23941 > xemacs-21.2.20/
23942 > xemacs-packages/
23943 > xemacs-mule-packages/
23944 >
23945 > I am sorry to shout, but the current situation is just completely
23946 > insane.
23947 >
23948 > ben
23949
23950
23951
23952 From:
23953 Ben Wing <ben@@666.com>
23954 12/6/1999 4:19 AM
23955
23956 Subject:
23957 Re: Please Vote on Proposals
23958 To:
23959 Kyle Jones <kyle_jones@@wonderworks.com>
23960 CC:
23961 XEmacs Review <xemacs-review@@xemacs.org>
23962
23963
23964
23965
23966 OK Kyle, how about a different proposal:
23967
23968 1. The distribution consists of the following three parts (let's assume
23969 v21.2.25):
23970
23971 -- xemacs-21.2.25-core.tar.gz
23972 The same as would currently in xemacs-21.2.25.tar.gz. You can
23973 run this editor and edit in fundamental mode, but not do anything
23974 else.
23975
23976 -- xemacs-21.2.25-core-packages.tar.gz
23977 A useful and complete subset of all the possible packages. Selection
23978 of
23979 what goes in and what goes out is based partially on consensus,
23980 partially
23981 on vote, and partially on these criteria:
23982
23983 -- commonly-used packages go in.
23984 -- unmaintained or out-of-date packages go out.
23985 -- buggy, poorly-written packages go out.
23986 -- really obscure packages that hardly anybody could possibly care
23987 about go out.
23988 -- when there are two or three packages implementing basically the
23989 same functionality, pick only one to go in unless there are two
23990 that
23991 both are really commonly-used.
23992 -- if a package can be loaded implicitly as a result of something in
23993 the
23994 core, it needs to go in, regardless of whether it's been
23995 maintained.
23996 This applies, for example, to the mode files -- @strong{all} mode
23997 packages must
23998 go in (or more properly, every mode must have a corresponding
23999 package
24000 that's in, although if there are two or more packages implementing
24001 a
24002 particular mode, e.g. html, we are free to choose just one).
24003
24004 -- xemacs-21.2.25-aux-packages.tar.gz
24005 All of the packages not in the previous file. Generally
24006 crappy-quality,
24007 poorly-maintained code.
24008
24009 Note, we do not make distinctions between Mule and non-Mule in our
24010 packaging scheme -- this is a bug and XEmacs and/or the packages should
24011 be fixed up so that this goes away.
24012
24013 2. The distribution also contains two combination files:
24014
24015 -- xemacs-21.2.25.tar.gz
24016 This is the "default" file that a naive user ought to retrieve, and
24017 he'll get a running XEmacs, just like he wants, and comfortable, too,
24018 because all of the common packages are there. This file is a
24019 combination
24020 of xemacs-21.2.25-core.tar.gz and xemacs-21.2.25-core-packages.tar.gz.
24021
24022 -- xemacs-21.2.25-everything.tar.gz
24023 This file contains absolutely everything, like it advertises --
24024 including the aux packages and all of their associated crappy-quality,
24025
24026 unmaintained code. This file is a combination of
24027 xemacs-21.2.25-core.tar.gz,
24028 xemacs-21.2.25-core-packages.tar.gz, and
24029 xemacs-21.2.25-aux-packages.tar.gz.
24030
24031
24032 I like this proposal better than the previous one I advocated, because it
24033 follows your good suggestion of separating the wheat from the chaff in
24034 the packages, so to speak. People will grab xemacs-21.2.25.tar.gz by
24035 default, just like they should,
24036 and they'll get something they're quite happy with, and we're happy
24037 because we can exercise quality control over the packages and exclude the
24038 crappy ones most likely to cause grief later on.
24039
24040
24041 What say y'all?
24042
24043 ben
24044
24045
24046
24047 Kyle Jones wrote:
24048
24049 > Ben Wing writes:
24050 > > Disagree. Please let's follow everyone else's convention, and not
24051 > > introduce yet another randomness.
24052 >
24053 > It is not randomness! I think this is a semantic issue and an
24054 > important one. The issue is: What do we consider part of XEmacs
24055 > and what is considered external to XEmacs. If you put all the
24056 > packages in xemacs.tar.gz, then users can reasonably and wrongly
24057 > assume that all this random Lisp code is maintained by us. We
24058 > are trying to stay away from that model because in the past it has
24059 > left us with piles and piles of orphaned code. Even if every one
24060 > of us were paid to maintain XEmacs, it is just not practical for
24061 > us to continue to maintain all that code, let alone any new code.
24062 > So I think the naming distinction Jan is making is worth doing.
24063 >
24064 > Also, I don't consider the current situation broken, except
24065 > perhaps the sumo tarball being out of date. I never, ever,
24066 > though it was a great idea to ship all the stuff that XEacs
24067 > shipped in the old days. Because this pile of code was always
24068 > around in the distribution, an enormous web of undocumented
24069 > dependencies was constructed. Eventually, you HAD to install
24070 > everything because if you left something out or removed something
24071 > you never knew when XEmacs would throw an error. Thus the Cult
24072 > of the Cargo was born.
24073 >
24074 > One of the best things that came out of the package system was
24075 > the month or two we spent running XEmacs without all the assorted
24076 > Lisp installed. Dependencies were removed or documented, some
24077 > stuff got retired, and for the first time we actually had a full
24078 > accounting of what we were shipping. I currently run XEmacs with
24079 > 7 packages and I don't miss the other stuff.
24080 >
24081 > Having come this far, I do not think we should go back to
24082 > advocating that everyone just install everything and not
24083 > think about they are doing. Besides saving space and startup
24084 > time, another reason to not install everything is that you
24085 > won't bloat your XEmacs process nearly as much if you go
24086 > exploring in the Custom menus, because there won't be as much
24087 > Lisp loaded as Custom sets up its groups and whatnot.
24088
24089 --
24090 In order to save my hands, I am cutting back on my responses, especially
24091 to XEmacs-related mail. You _will_ get a response, but please be
24092 patient.
24093 If you need an immediate response and it is not apparent in your message,
24094
24095 please say so. Thanks for your understanding.
20031 @end example 24096 @end example
20032 24097
20033 @node Old Future Work, Index, Future Work Discussion, Top 24098 @node Old Future Work, Index, Future Work Discussion, Top
20034 @chapter Old Future Work 24099 @chapter Old Future Work
20035 @cindex old future work 24100 @cindex old future work
20039 implemented. These proposals are included because they may describe to 24104 implemented. These proposals are included because they may describe to
20040 some extent the actual workings of the implemented code, and because 24105 some extent the actual workings of the implemented code, and because
20041 they may discuss relevant design issues, alternative implementations, or 24106 they may discuss relevant design issues, alternative implementations, or
20042 work still to be done. 24107 work still to be done.
20043 24108
20044
20045 @menu 24109 @menu
20046 * Future Work -- A Portable Unexec Replacement:: 24110 * Old Future Work -- A Portable Unexec Replacement::
20047 * Future Work -- Indirect Buffers:: 24111 * Old Future Work -- Indirect Buffers::
20048 * Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: 24112 * Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X::
20049 * Future Work -- xemacs.org Mailing Address Changes:: 24113 * Old Future Work -- RTF Clipboard Support::
20050 * Future Work -- Lisp callbacks from critical areas of the C code:: 24114 * Old Future Work -- xemacs.org Mailing Address Changes::
24115 * Old Future Work -- Lisp callbacks from critical areas of the C code::
20051 @end menu 24116 @end menu
20052 24117
20053 @node Future Work -- A Portable Unexec Replacement, Future Work -- Indirect Buffers, Old Future Work, Old Future Work 24118 @node Old Future Work -- A Portable Unexec Replacement, Old Future Work -- Indirect Buffers, Old Future Work, Old Future Work
20054 @section Future Work -- A Portable Unexec Replacement 24119 @section Old Future Work -- A Portable Unexec Replacement
20055 @cindex future work, a portable unexec replacement 24120 @cindex old future work, a portable unexec replacement
20056 @cindex a portable unexec replacement, future work 24121 @cindex a portable unexec replacement, old future work
24122
24123 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20057 24124
20058 @strong{Abstract:} Currently, during the build stage of XEmacs, a bare 24125 @strong{Abstract:} Currently, during the build stage of XEmacs, a bare
20059 version of the program (called @dfn{temacs}) is run, which loads up a 24126 version of the program (called @dfn{temacs}) is run, which loads up a
20060 bunch of Lisp data and then writes out a modified executable file. This 24127 bunch of Lisp data and then writes out a modified executable file. This
20061 process is very tricky to implement and highly system-dependent. It can 24128 process is very tricky to implement and highly system-dependent. It can
20179 preprocessor, or by simply using a different name, such as 24246 preprocessor, or by simply using a different name, such as
20180 @code{xmalloc}. It's also very important that we use the correct 24247 @code{xmalloc}. It's also very important that we use the correct
20181 @code{free} function when freeing dynamically-allocated data, depending 24248 @code{free} function when freeing dynamically-allocated data, depending
20182 on whether this data was allocated by us or by the 24249 on whether this data was allocated by us or by the
20183 24250
20184 @node Future Work -- Indirect Buffers, Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Future Work -- A Portable Unexec Replacement, Old Future Work 24251 @node Old Future Work -- Indirect Buffers, Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work -- A Portable Unexec Replacement, Old Future Work
20185 @section Future Work -- Indirect Buffers 24252 @section Old Future Work -- Indirect Buffers
20186 @cindex future work, indirect buffers 24253 @cindex old future work, indirect buffers
20187 @cindex indirect buffers, future work 24254 @cindex indirect buffers, old future work
24255
24256 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20188 24257
20189 An indirect buffer is a buffer that shares its text with some other 24258 An indirect buffer is a buffer that shares its text with some other
20190 buffer, but has its own version of all of the buffer properties, 24259 buffer, but has its own version of all of the buffer properties,
20191 including markers, extents, buffer local variables, etc. Indirect 24260 including markers, extents, buffer local variables, etc. Indirect
20192 buffers are not currently implemented in XEmacs, but they are in GNU 24261 buffers are not currently implemented in XEmacs, but they are in GNU
20258 done only once, rather than on each buffer. I imagine it would be 24327 done only once, rather than on each buffer. I imagine it would be
20259 significantly easier to implement this, if a macro were created for 24328 significantly easier to implement this, if a macro were created for
20260 iterating over a buffer, and then all of the indirect children of that 24329 iterating over a buffer, and then all of the indirect children of that
20261 buffer. 24330 buffer.
20262 24331
20263 @node Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Future Work -- xemacs.org Mailing Address Changes, Future Work -- Indirect Buffers, Old Future Work 24332 @node Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work -- RTF Clipboard Support, Old Future Work -- Indirect Buffers, Old Future Work
20264 @section Future Work -- Improvements in support for non-ASCII (European) keysyms under X 24333 @section Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X
20265 @cindex future work, improvements in support for non-ascii (european) keysyms under x 24334 @cindex old future work, improvements in support for non-ascii (european) keysyms under x
20266 @cindex improvements in support for non-ascii (european) keysyms under x, future work 24335 @cindex improvements in support for non-ascii (european) keysyms under x, old future work
20267 24336
20268 From Martin Buchholz. 24337 Author: @uref{mailto:martin@@xemacs.org,Martin Buchholz}
20269 24338
20270 If a user has a keyboard with known standard non-ASCII character 24339 If a user has a keyboard with known standard non-ASCII character
20271 equivalents, typically for European users, then Emacs' default 24340 equivalents, typically for European users, then Emacs' default
20272 binding should be self-insert-command, with the obvious character 24341 binding should be self-insert-command, with the obvious character
20273 inserted. For example, if a user has a keyboard with 24342 inserted. For example, if a user has a keyboard with
20282 even be bound to anything by a user trying to customize it. 24351 even be bound to anything by a user trying to customize it.
20283 24352
20284 This is implemented by maintaining a table of translations between all 24353 This is implemented by maintaining a table of translations between all
20285 the known X keysym names and the corresponding (charset, octet) pairs. 24354 the known X keysym names and the corresponding (charset, octet) pairs.
20286 24355
24356 @quotation
20287 For every key on the keyboard that has a known character correspondence, 24357 For every key on the keyboard that has a known character correspondence,
20288 we define the ascii-character property of the keysym, and make the 24358 we define the ascii-character property of the keysym, and make the
20289 default binding for the key be self-insert-command. 24359 default binding for the key be self-insert-command.
20290 24360
20291 The following magic is basically intimate knowledge of X11/keysymdef.h. 24361 The following magic is basically intimate knowledge of X11/keysymdef.h.
20293 except for Cyrillic and Greek. 24363 except for Cyrillic and Greek.
20294 24364
20295 In a non-Mule world, a user can still have a multi-lingual editor, by doing 24365 In a non-Mule world, a user can still have a multi-lingual editor, by doing
20296 (set-face-font "...-iso8859-2" (current-buffer)) 24366 (set-face-font "...-iso8859-2" (current-buffer))
20297 for all their Latin-2 buffers, etc. 24367 for all their Latin-2 buffers, etc.
20298 24368 @end quotation
20299 @node Future Work -- xemacs.org Mailing Address Changes, Future Work -- Lisp callbacks from critical areas of the C code, Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work 24369
20300 @section Future Work -- xemacs.org Mailing Address Changes 24370 @node Old Future Work -- RTF Clipboard Support, Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work
20301 @cindex future work, xemacs.org mailing address changes 24371 @section Old Future Work -- RTF Clipboard Support
20302 @cindex xemacs.org mailing address changes, future work 24372 @cindex old future work, RTF clipboard support
24373 @cindex RTF clipboard support, old future work
24374
24375 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
24376
24377 in fact, i merged the windows stuff with the already-existing generic code.
24378
24379 what i'd like to see is something like this:
24380
24381 @enumerate
24382 @item
24383 The current function
24384
24385 @example
24386 (defun own-selection (data &optional type append)
24387 @end example
24388
24389 should become
24390
24391 @example
24392 (defun own-selection (data &optional type how-to-add data-type)
24393 @end example
24394
24395 where data-type is the mswindows format, and how-to-add is
24396
24397 @example
24398 'replace-all or nil -- remove data for all formats
24399 'replace-existing -- remove data for DATA-TYPE, but leave other formats alone
24400 'append or t -- append data to existing data in DATA-TYPE, and leave other
24401 formats alone
24402 @end example
24403
24404 @item
24405 the function
24406
24407 @example
24408 (get-selection &optional TYPE DATA-TYPE)
24409 @end example
24410
24411 already has a data-type so you don't need to change it.
24412
24413 @item
24414 the existing function
24415
24416 @example
24417 (selection-exists-p &optional SELECTION DEVICE)
24418 @end example
24419
24420 should become
24421
24422 @example
24423 (selection-exists-p &optional SELECTION DEVICE DATA-TYPE)
24424 @end example
24425
24426 @item
24427 a new function
24428
24429 @example
24430 (register-selection-data-type DATA-TYPE)
24431 @end example
24432
24433 like your mswindows-register-clipboard-format.
24434
24435 @item
24436 there's already a selection-converter-alist, but that's only for data out.
24437 you should alias it to selection-conversion-out-alist, and create
24438 selection-conversion-in-alist. these alists contain entries for CF_TEXT, which
24439 handles CR/LF conversion, and rtf, which does rtf in/out conversion -- no need
24440 for separate functions to do this.
24441
24442 this may seem daunting, but it's much less hard to add stuff like this than it
24443 seems, and i and others will certainly give you lots of support if you run into
24444 problems. it would be way cool to have a more powerful clipboard mechanism in
24445 XEmacs.
24446 @end enumerate
24447
24448 @node Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work -- Lisp callbacks from critical areas of the C code, Old Future Work -- RTF Clipboard Support, Old Future Work
24449 @section Old Future Work -- xemacs.org Mailing Address Changes
24450 @cindex old future work, xemacs.org mailing address changes
24451 @cindex xemacs.org mailing address changes, old future work
24452
24453 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
20303 24454
20304 @subheading Personal addresses 24455 @subheading Personal addresses
20305 24456
20306 @enumerate 24457 @enumerate
20307 @item 24458 @item
20380 addresses set up will make it much easier for this momentum to be built 24531 addresses set up will make it much easier for this momentum to be built
20381 up and to remain. 24532 up and to remain.
20382 24533
20383 @uref{../../www.666.com/ben/default.htm,Ben Wing} 24534 @uref{../../www.666.com/ben/default.htm,Ben Wing}
20384 24535
20385 @node Future Work -- Lisp callbacks from critical areas of the C code, , Future Work -- xemacs.org Mailing Address Changes, Old Future Work 24536 @node Old Future Work -- Lisp callbacks from critical areas of the C code, , Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work
20386 @section Future Work -- Lisp callbacks from critical areas of the C code 24537 @section Old Future Work -- Lisp callbacks from critical areas of the C code
20387 @cindex future work, lisp callbacks from critical areas of the c code 24538 @cindex old future work, lisp callbacks from critical areas of the c code
20388 @cindex lisp callbacks from critical areas of the c code, future work 24539 @cindex lisp callbacks from critical areas of the c code, old future work
20389 24540
20390 @example 24541 Author: @uref{mailto:ben@@xemacs.org,Ben Wing}
24542
20391 There are many places in the XEmacs C code where Lisp functions are 24543 There are many places in the XEmacs C code where Lisp functions are
20392 called, usually because the Lisp function is acting as a callback, 24544 called, usually because the Lisp function is acting as a callback,
20393 hook, process filter, or the like. The lisp code is often called in 24545 hook, process filter, or the like. The lisp code is often called in
20394 places where some lisp operations are dangerous. Currently there are 24546 places where some lisp operations are dangerous. Currently there are
20395 a lot of ad-hoc schemes implemented to try to prevent these dangerous 24547 a lot of ad-hoc schemes implemented to try to prevent these dangerous
20421 24573
20422 Corresponding to each of these entries is the C name of the bit flag. 24574 Corresponding to each of these entries is the C name of the bit flag.
20423 24575
20424 The sets of dangerous operations which can be prohibited are: 24576 The sets of dangerous operations which can be prohibited are:
20425 24577
20426 OPERATION_GC_PROHIBITED 24578 @table @code
20427 1. garbage collection. When this flag is set, and the garbage 24579 @item OPERATION_GC_PROHIBITED
20428 collection threshold is reached, garbage collection simply doesn't 24580 garbage collection. When this flag is set, and the garbage
20429 happen. It will happen at the next opportunity that it is allowed. 24581 collection threshold is reached, garbage collection simply doesn't
20430 Similarly, explicitly calling the Lisp function garbage-collect 24582 happen. It will happen at the next opportunity that it is allowed.
20431 simply does nothing. 24583 Similarly, explicitly calling the Lisp function garbage-collect
20432 24584 simply does nothing.
20433 OPERATION_CATCH_ERRORS 24585
20434 2. signalling an error. When @code{enter_sensitive_code_section()} is 24586 @item OPERATION_CATCH_ERRORS
20435 called, with the bit flag corresponding to this prohibited 24587 signalling an error. When @code{enter_sensitive_code_section()} is
20436 operation. When this bit flag is passed to 24588 called, with the bit flag corresponding to this prohibited
20437 @code{enter_sensitive_code_section()}, a catch is set up which catches all 24589 operation. When this bit flag is passed to
20438 errors, signals a warning with @code{warn_when_safe()}, and then simply 24590 @code{enter_sensitive_code_section()}, a catch is set up which catches all
20439 continues. This is exactly the same behavior you now get with the 24591 errors, signals a warning with @code{warn_when_safe()}, and then simply
20440 @code{call_*_trapping_errors()} functions. (there should also be some way 24592 continues. This is exactly the same behavior you now get with the
20441 of specifying a warning level and class here, similar to the 24593 @code{call_*_trapping_errors()} functions. (there should also be some way
20442 @code{call_*_trapping_errors()} functions. This is not completely 24594 of specifying a warning level and class here, similar to the
20443 important, however, because a standard warning level and class 24595 @code{call_*_trapping_errors()} functions. This is not completely
20444 could simply be chosen.) 24596 important, however, because a standard warning level and class
20445 24597 could simply be chosen.)
20446 OPERATION_NO_UNSAFE_OBJECT_DELETION 24598
20447 3. This flag prohibits deletion of any permanent object (i.e. any 24599 @item OPERATION_NO_UNSAFE_OBJECT_DELETION
20448 object that does not automatically disappear when created, such as 24600 This flag prohibits deletion of any permanent object (i.e. any
20449 buffers, frames, devices, windows, etc...) unless they were created 24601 object that does not automatically disappear when created, such as
20450 after this bit flag was set. This would be implemented using a 24602 buffers, frames, devices, windows, etc...) unless they were created
20451 list which stores all of the permanent objects created after this 24603 after this bit flag was set. This would be implemented using a
20452 bit flag was set. This list is reset to its previous value when 24604 list which stores all of the permanent objects created after this
20453 the call to @code{exit_sensitive_code_section()} occurs. The motivation 24605 bit flag was set. This list is reset to its previous value when
20454 here is to allow Lisp callbacks to create their own temporary 24606 the call to @code{exit_sensitive_code_section()} occurs. The motivation
20455 buffers or frames, and later delete them, but not allow any other 24607 here is to allow Lisp callbacks to create their own temporary
20456 permanent objects to be deleted, because C code might be working 24608 buffers or frames, and later delete them, but not allow any other
20457 with them, and not expect them to change. 24609 permanent objects to be deleted, because C code might be working
20458 24610 with them, and not expect them to change.
20459 OPERATION_NO_BUFFER_MODIFICATION 24611
20460 4. This flag disallows modifications to the text, extent or any other 24612 @item OPERATION_NO_BUFFER_MODIFICATION
20461 properties of any buffers except those created after this flag was 24613 This flag disallows modifications to the text, extent or any other
20462 set, just like in the previous entry. 24614 properties of any buffers except those created after this flag was
20463 24615 set, just like in the previous entry.
20464 OPERATION_NO_REDISPLAY 24616
20465 5. This bit flag inhibits any redisplay-related operations from 24617 @item OPERATION_NO_REDISPLAY
20466 happening, more specifically, any entry into the redisplay-related 24618 This bit flag inhibits any redisplay-related operations from
20467 code. This includes, for example, the Lisp functions sit-for, 24619 happening, more specifically, any entry into the redisplay-related
20468 force-redisplay, force-cursor-redisplay, window-end with certain 24620 code. This includes, for example, the Lisp functions sit-for,
20469 arguments to it, and various other functions. When this flag is 24621 force-redisplay, force-cursor-redisplay, window-end with certain
20470 set, instead of entering the redisplay code, the calling function 24622 arguments to it, and various other functions. When this flag is
20471 should simply make sure not to enter the redisplay code, (for 24623 set, instead of entering the redisplay code, the calling function
20472 example, in the case of window-end), or postpone the redisplay 24624 should simply make sure not to enter the redisplay code, (for
20473 until such a time when it's safe (for example, with sit-for and 24625 example, in the case of window-end), or postpone the redisplay
20474 force-redisplay). 24626 until such a time when it's safe (for example, with sit-for and
20475 24627 force-redisplay).
20476 OPERATION_NO_REDISPLAY_SETTINGS_CHANGE 24628
20477 6. This flag prohibits any modifications to faces, glyphs, specifiers, 24629 @item OPERATION_NO_REDISPLAY_SETTINGS_CHANGE
20478 extents, or any other settings that will affect the way that any 24630 This flag prohibits any modifications to faces, glyphs, specifiers,
20479 window is displayed. 24631 extents, or any other settings that will affect the way that any
20480 24632 window is displayed.
24633 @end table
20481 24634
20482 The idea here is that it will finally be safe to call Lisp code from 24635 The idea here is that it will finally be safe to call Lisp code from
20483 nearly any part of the C code, simply by setting any combination of 24636 nearly any part of the C code, simply by setting any combination of
20484 restricted operation bit flags. This even includes from within 24637 restricted operation bit flags. This even includes from within
20485 redisplay. (in such a case, all of the bit flags need to be set). The 24638 redisplay. (in such a case, all of the bit flags need to be set). The
20486 reason that I thought of this is that some coding system translations 24639 reason that I thought of this is that some coding system translations
20487 might cause Lisp code to be invoked and C code often invokes these 24640 might cause Lisp code to be invoked and C code often invokes these
20488 translations in sensitive places. 24641 translations in sensitive places.
20489 @end example
20490 24642
20491 @c Indexing guidelines 24643 @c Indexing guidelines
20492 24644
20493 @c I assume that all indexes will be combined. 24645 @c I assume that all indexes will be combined.
20494 @c Therefore, if a generated findex and permutations 24646 @c Therefore, if a generated findex and permutations