Mercurial > hg > xemacs-beta
comparison man/internals/internals.texi @ 2365:ce4aa0ef8af1
[xemacs-hg @ 2004-11-04 07:48:14 by ben]
Major work on internals manual. Rearranged many chapters so as to lie in coherent divisions.
Add tons of stuff to Future Work, Old Future Work, Discussions.
Add lots of stuff to Mule section (Multilingual ...).
Remove index.texi, incorporate into internals.texi.
Section on early history and an introduction.
Section on XEmacs split. Lots of new MS Windows docs
Mostly recently: Windows-I18N docs. Lots if new I18N docs.
Loads of other stuff.
.
author | ben |
---|---|
date | Thu, 04 Nov 2004 07:48:14 +0000 |
parents | 6aa56b089139 |
children | 2d4dd2ef74e7 |
comparison
equal
deleted
inserted
replaced
2364:28dea3be3c6c | 2365:ce4aa0ef8af1 |
---|---|
126 | 126 |
127 @ifinfo | 127 @ifinfo |
128 This Info file contains v21.5 of the XEmacs Internals Manual, October 2004. | 128 This Info file contains v21.5 of the XEmacs Internals Manual, October 2004. |
129 @end ifinfo | 129 @end ifinfo |
130 | 130 |
131 @c Don't update this by hand!!!!!! | 131 @ignore |
132 @c Use C-u C-c C-u m (aka C-u M-x texinfo-master-list). | 132 Don't update this by hand!!!!!! |
133 @c NOTE: This command does not include the Index:: menu entry. | 133 Use C-u C-c C-u m (aka C-u M-x texinfo-master-list). |
134 @c You must add it by hand. | 134 NOTE: This command does not include the Index:: menu entry. |
135 | 135 You must add it by hand. |
136 @c Here are some useful Lisp routines for quickly Texinfo-izing text that | 136 |
137 @c has been formatted into ASCII lists and tables. The first routine is | 137 Here are some useful Lisp routines for quickly Texinfo-izing text that |
138 @c currently more general and well-developed than the second. | 138 has been formatted into ASCII lists and tables. |
139 | 139 |
140 @c (defun list-to-texinfo (b e) | 140 (defun list-to-texinfo (b e) |
141 @c "Convert the selected region from an ASCII list to a Texinfo list." | 141 "Convert the selected region from an ASCII list to a Texinfo list." |
142 @c (interactive "r") | 142 (interactive "r") |
143 @c (save-restriction | 143 (save-restriction |
144 @c (narrow-to-region b e) | 144 (narrow-to-region b e) |
145 @c (goto-char (point-min)) | 145 (goto-char (point-min)) |
146 @c (let ((dash-type "^ *-+ +") | 146 (let ((dash-type "^ *-+ +") |
147 @c (num-type "^ *[[(]?\\([0-9]+\\|[a-z]\\)[]).] +") | 147 ;; allow single-letter numbering or roman numerals |
148 @c dash) | 148 (letter-type "^ *[[(]?\\([a-zA-Z]\\|[IVXivx]+\\)[]).] +") |
149 @c (save-excursion | 149 (num-type "^ *[[(]?[0-9]+[]).] +") |
150 @c (cond ((re-search-forward num-type nil t)) | 150 dash regexp) |
151 @c ((re-search-forward dash-type nil t) (setq dash t)) | 151 (save-excursion |
152 @c (t (error "No table entries?")))) | 152 (re-search-forward "\\s-*") |
153 @c (if dash (insert "@itemize @bullet\n") | 153 (cond ((looking-at dash-type) (setq regexp dash-type dash t)) |
154 @c (insert "@enumerate\n")) | 154 ((looking-at letter-type) (setq regexp letter-type)) |
155 @c (while (re-search-forward (if dash dash-type num-type) nil t) | 155 ((looking-at num-type) (setq regexp num-type)) |
156 @c (let ((p (point))) | 156 ((re-search-forward num-type nil t) (setq regexp num-type)) |
157 @c (or (re-search-forward (if dash dash-type num-type) nil t) | 157 ((re-search-forward letter-type nil t) (setq regexp letter-type)) |
158 @c (goto-char (point-max))) | 158 ((re-search-forward dash-type nil t) |
159 @c (beginning-of-line) | 159 (setq regexp dash-type dash t)) |
160 @c (forward-line -1) | 160 (t (error "No table entries?")))) |
161 @c (let ((q (point))) | 161 (if dash (insert "@itemize @bullet\n") |
162 @c (goto-char p) | 162 (insert "@enumerate\n")) |
163 @c (kill-rectangle p q)) | 163 (re-search-forward regexp nil 'limit) |
164 @c (insert "@item\n"))) | 164 (while (not (eobp)) |
165 @c (goto-char (point-max)) | 165 (delete-region (point-at-bol) (point)) |
166 @c (beginning-of-line) | 166 (insert "@item\n") |
167 @c (if dash (insert "@end itemize\n") | 167 ;; move forward over any text following the dash to not screw |
168 @c (insert "@end enumerate\n"))))) | 168 ;; up remove-spacing. |
169 | 169 (forward-line 1) |
170 @c (defun table-to-texinfo (b e) | 170 (let ((p (point))) |
171 @c "Convert the selected region from an ASCII table to a Texinfo table." | 171 (or (re-search-forward regexp nil t) |
172 @c (interactive "r") | 172 (goto-char (point-max))) |
173 @c (save-restriction | 173 ;; trick to avoid using a marker |
174 @c (narrow-to-region b e) | 174 (save-excursion |
175 @c (goto-char (point-min)) | 175 ;; back up so as not to affect the line we're on (beginning of |
176 @c (insert "@table @code\n") | 176 ;; next entry) |
177 @c (while (not (eobp)) | 177 (forward-line -1) |
178 @c (insert "@item ") | 178 (remove-spacing p (point))))) |
179 @c (forward-sexp) | 179 (beginning-of-line) |
180 @c (delete-char) | 180 (if dash (insert "@end itemize\n") |
181 @c (insert "\n") | 181 (insert "@end enumerate\n"))))) |
182 @c (or (search-forward "\n\n" nil t) | 182 |
183 @c (goto-char (point-max)))) | 183 (defun remove-spacing (b e) |
184 @c (beginning-of-line) | 184 "Remove leading space from the selected region. |
185 @c (insert "@end table\n"))) | 185 This finds the maximum leading blank area common to all lines in the region. |
186 | 186 This includes all lines any part of which are in the region." |
187 @c A useful Lisp routine for adding markup based on conventions used in plain | 187 (interactive "r") |
188 @c text files; see doc string below. | 188 (save-excursion |
189 | 189 (let ((min 999999) |
190 @c (defun convert-text-to-texinfo (&optional no-narrow) | 190 seen) |
191 @c "Convert text to Texinfo. | 191 (goto-char e) |
192 @c If the region is active, do the region; otherwise, go from point to the end | 192 (end-of-line) |
193 @c of the buffer. This query-replaces for various kinds of conventions used | 193 (setq e (point)) |
194 @c in text: @code{} surrounded by ` and ' or followed by a (); @strong{} | 194 (goto-char b) |
195 @c surrounded by *'s; @file{} something that looks like a file name." | 195 (beginning-of-line) |
196 @c (interactive) | 196 (setq b (point)) |
197 @c (if (region-active-p) | 197 (while (< (point) e) |
198 @c (save-restriction | 198 (cond ((looking-at "^\\s-+") |
199 @c (narrow-to-region (region-beginning) (region-end)) | 199 (goto-char (match-end 0)) |
200 @c (convert-comments-to-texinfo t)) | 200 (setq min (min min (current-column)) |
201 @c (let ((p (point)) | 201 seen t)) |
202 @c (case-replace nil)) | 202 ((looking-at "^\\s-*$")) |
203 @c (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil) | 203 (t (setq min 0))) |
204 @c (goto-char p) | 204 (forward-line 1)) |
205 @c (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil) | 205 (when (and seen (> min 0)) |
206 @c (goto-char p) | 206 (goto-char e) |
207 @c (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil) | 207 (untabify b e) |
208 @c (goto-char p) | 208 ;; we are at end of line already. |
209 @c (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil) | 209 (if (not (= (point) (point-at-eol))) |
210 @c ))) | 210 (error "Logic error")) |
211 ;; Pad line with spaces if necessary (it may be just a blank line) | |
212 (if (< (current-column) min) | |
213 (insert-char ?\ (- min (current-column))) | |
214 (beginning-of-line) | |
215 (forward-char min)) | |
216 (kill-rectangle b (point)))))) | |
217 | |
218 (defun table-to-texinfo (b e) | |
219 "Convert the selected region from an ASCII table to a Texinfo table. | |
220 Assumes entries are separated by a blank line, and the first sexp in | |
221 each entry is the table heading." | |
222 (interactive "r") | |
223 (save-restriction | |
224 (narrow-to-region b e) | |
225 (goto-char (point-min)) | |
226 (insert "@table @code\n") | |
227 (while (not (eobp)) | |
228 ;; remember where we want to insert the @item. | |
229 ;; delete the spacing first since inserting the @item may create | |
230 ;; a line with no spacing, if there is text following the heading on | |
231 ;; the same line. | |
232 (let ((beg (point))) | |
233 ;; removing the space and inserting the @item will change the | |
234 ;; position of the end of the region, so to make it easy on us | |
235 ;; leave point at end so it will be adjusted. | |
236 (forward-line 1) | |
237 (let ((beg2 (point))) | |
238 (or (re-search-forward "^$" nil t) | |
239 (goto-char (point-max))) | |
240 (backward-char 1) | |
241 (remove-spacing beg2 (point))) | |
242 (ignore-errors (forward-char 2)) | |
243 (save-excursion | |
244 (goto-char beg) | |
245 (insert "@item ") | |
246 (forward-sexp) | |
247 (delete-char) | |
248 (insert "\n")))) | |
249 (beginning-of-line) | |
250 (insert "@end table\n"))) | |
251 | |
252 A useful Lisp routine for adding markup based on conventions used in plain | |
253 text files; see doc string below. | |
254 | |
255 (defun convert-text-to-texinfo (&optional no-narrow) | |
256 "Convert text to Texinfo. | |
257 If the region is active, do the region; otherwise, go from point to the end | |
258 of the buffer. This query-replaces for various kinds of conventions used | |
259 in text: @code{} surrounded by ` and ' or followed by a (); @strong{} | |
260 surrounded by *'s; @file{} something that looks like a file name." | |
261 (interactive) | |
262 (if (region-active-p) | |
263 (save-restriction | |
264 (narrow-to-region (region-beginning) (region-end)) | |
265 (convert-comments-to-texinfo t)) | |
266 (let ((p (point)) | |
267 (case-replace nil)) | |
268 (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil) | |
269 (goto-char p) | |
270 (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil) | |
271 (goto-char p) | |
272 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil) | |
273 (goto-char p) | |
274 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil) | |
275 ))) | |
276 | |
277 Macro the generate the "Future Work" section from a title; put | |
278 point at beginning. | |
279 | |
280 (defalias 'make-future (read-kbd-macro | |
281 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Future SPC Work SPC - - SPC <end> RET @cindex SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC future SPC work RET")) | |
282 | |
283 Similar but generates a "Discussion" section. | |
284 | |
285 (defalias 'make-discussion (read-kbd-macro | |
286 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Discussion SPC - - SPC <home> <down> <C-right> <right> Discussion SPC - - SPC <end> RET @cindex SPC discussion, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC discussion RET")) | |
287 | |
288 Similar but generates an "Old Future Work" section. | |
289 | |
290 (defalias 'make-old-future (read-kbd-macro | |
291 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Old SPC Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Old SPC Future SPC Work SPC - - SPC <end> RET @cindex SPC old SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC old SPC future SPC work RET")) | |
292 | |
293 Similar but generates a general section. | |
294 | |
295 (defalias 'make-section (read-kbd-macro | |
296 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> RET @cindex SPC C-SPC C-g <f4> C-x C-x M-l <home> <down>")) | |
297 | |
298 Similar but generates a general subsection. | |
299 | |
300 (defalias 'make-subsection (read-kbd-macro | |
301 "<S-end> <f3> <home> @node SPC <end> RET @subsection SPC <f4> RET @cindex SPC C-SPC C-g <f4> C-x C-x M-l <home> <down>")) | |
302 @end ignore | |
211 | 303 |
212 @menu | 304 @menu |
213 * Introduction:: Overview of this manual. | 305 * Introduction:: Overview of this manual. |
214 * Authorship of XEmacs:: | 306 * Authorship of XEmacs:: |
215 * A History of Emacs:: Times, dates, important events. | 307 * A History of Emacs:: Times, dates, important events. |
216 * XEmacs From the Outside:: A broad conceptual overview. | 308 * The XEmacs Split:: |
309 * XEmacs from the Outside:: A broad conceptual overview. | |
217 * The Lisp Language:: An overview. | 310 * The Lisp Language:: An overview. |
218 * XEmacs From the Perspective of Building:: | 311 * XEmacs from the Perspective of Building:: |
219 * Build-Time Dependencies:: | 312 * Build-Time Dependencies:: |
220 * XEmacs From the Inside:: | 313 * The Modules of XEmacs:: |
221 * The XEmacs Object System (Abstractly Speaking):: | |
222 * How Lisp Objects Are Represented in C:: | |
223 * Major Textual Changes:: | 314 * Major Textual Changes:: |
224 * Rules When Writing New C Code:: | 315 * Rules When Writing New C Code:: |
225 * Regression Testing XEmacs:: | 316 * Regression Testing XEmacs:: |
226 * CVS Techniques:: | 317 * CVS Techniques:: |
227 * The Modules of XEmacs:: | 318 * XEmacs from the Inside:: |
319 * The XEmacs Object System (Abstractly Speaking):: | |
320 * How Lisp Objects Are Represented in C:: | |
228 * Allocation of Objects in XEmacs Lisp:: | 321 * Allocation of Objects in XEmacs Lisp:: |
229 * Dumping:: | 322 * The Lisp Reader and Compiler:: |
230 * Events and the Event Loop:: | |
231 * Asynchronous Events; Quit Checking:: | |
232 * Evaluation; Stack Frames; Bindings:: | 323 * Evaluation; Stack Frames; Bindings:: |
233 * Symbols and Variables:: | 324 * Symbols and Variables:: |
234 * Buffers:: | 325 * Buffers:: |
235 * Text:: | 326 * Text:: |
236 * Multilingual Support:: | 327 * Multilingual Support:: |
237 * The Lisp Reader and Compiler:: | |
238 * Lstreams:: | |
239 * Consoles; Devices; Frames; Windows:: | 328 * Consoles; Devices; Frames; Windows:: |
240 * The Redisplay Mechanism:: | 329 * The Redisplay Mechanism:: |
241 * Extents:: | 330 * Extents:: |
242 * Faces:: | 331 * Faces:: |
243 * Glyphs:: | 332 * Glyphs:: |
244 * Specifiers:: | 333 * Specifiers:: |
245 * Menus:: | 334 * Menus:: |
335 * Events and the Event Loop:: | |
336 * Asynchronous Events; Quit Checking:: | |
337 * Lstreams:: | |
246 * Subprocesses:: | 338 * Subprocesses:: |
247 * Interface to MS Windows:: | 339 * Interface to MS Windows:: |
248 * Interface to the X Window System:: | 340 * Interface to the X Window System:: |
341 * Dumping:: | |
249 * Future Work:: | 342 * Future Work:: |
250 * Future Work Discussion:: | 343 * Future Work Discussion:: |
251 * Old Future Work:: | 344 * Old Future Work:: |
252 * Index:: | 345 * Index:: |
253 | 346 |
259 * Through Version 18:: Unification prevails. | 352 * Through Version 18:: Unification prevails. |
260 * Lucid Emacs:: One version 19 Emacs. | 353 * Lucid Emacs:: One version 19 Emacs. |
261 * GNU Emacs 19:: The other version 19 Emacs. | 354 * GNU Emacs 19:: The other version 19 Emacs. |
262 * GNU Emacs 20:: The other version 20 Emacs. | 355 * GNU Emacs 20:: The other version 20 Emacs. |
263 * XEmacs:: The continuation of Lucid Emacs. | 356 * XEmacs:: The continuation of Lucid Emacs. |
357 | |
358 The Modules of XEmacs | |
359 | |
360 * A Summary of the Various XEmacs Modules:: | |
361 * Low-Level Modules:: | |
362 * Basic Lisp Modules:: | |
363 * Modules for Standard Editing Operations:: | |
364 * Modules for Interfacing with the File System:: | |
365 * Modules for Other Aspects of the Lisp Interpreter and Object System:: | |
366 * Modules for Interfacing with the Operating System:: | |
264 | 367 |
265 Major Textual Changes | 368 Major Textual Changes |
266 | 369 |
267 * Great Integral Type Renaming:: | 370 * Great Integral Type Renaming:: |
268 * Text/Char Type Renaming:: | 371 * Text/Char Type Renaming:: |
285 * Modules for Regression Testing:: | 388 * Modules for Regression Testing:: |
286 | 389 |
287 CVS Techniques | 390 CVS Techniques |
288 | 391 |
289 * Merging a Branch into the Trunk:: | 392 * Merging a Branch into the Trunk:: |
290 | |
291 The Modules of XEmacs | |
292 | |
293 * A Summary of the Various XEmacs Modules:: | |
294 * Low-Level Modules:: | |
295 * Basic Lisp Modules:: | |
296 * Modules for Standard Editing Operations:: | |
297 * Modules for Interfacing with the File System:: | |
298 * Modules for Other Aspects of the Lisp Interpreter and Object System:: | |
299 * Modules for Interfacing with the Operating System:: | |
300 | 393 |
301 Allocation of Objects in XEmacs Lisp | 394 Allocation of Objects in XEmacs Lisp |
302 | 395 |
303 * Introduction to Allocation:: | 396 * Introduction to Allocation:: |
304 * Garbage Collection:: | 397 * Garbage Collection:: |
325 * sweep_lcrecords_1:: | 418 * sweep_lcrecords_1:: |
326 * compact_string_chars:: | 419 * compact_string_chars:: |
327 * sweep_strings:: | 420 * sweep_strings:: |
328 * sweep_bit_vectors_1:: | 421 * sweep_bit_vectors_1:: |
329 | 422 |
330 Dumping | 423 Evaluation; Stack Frames; Bindings |
331 | 424 |
332 * Dumping Justification:: | 425 * Evaluation:: |
333 * Overview:: | 426 * Dynamic Binding; The specbinding Stack; Unwind-Protects:: |
334 * Data descriptions:: | 427 * Simple Special Forms:: |
335 * Dumping phase:: | 428 * Catch and Throw:: |
336 * Reloading phase:: | 429 * Error Trapping:: |
337 * Remaining issues:: | 430 |
338 | 431 Symbols and Variables |
339 Dumping phase | 432 |
340 | 433 * Introduction to Symbols:: |
341 * Object inventory:: | 434 * Obarrays:: |
342 * Address allocation:: | 435 * Symbol Values:: |
343 * The header:: | 436 |
344 * Data dumping:: | 437 Buffers |
345 * Pointers dumping:: | 438 |
439 * Introduction to Buffers:: A buffer holds a block of text such as a file. | |
440 * Buffer Lists:: Keeping track of all buffers. | |
441 * Markers and Extents:: Tagging locations within a buffer. | |
442 * The Buffer Object:: The Lisp object corresponding to a buffer. | |
443 | |
444 Text | |
445 | |
446 * The Text in a Buffer:: Representation of the text in a buffer. | |
447 * Ibytes and Ichars:: Representation of individual characters. | |
448 * Byte-Char Position Conversion:: | |
449 * Searching and Matching:: Higher-level algorithms. | |
450 | |
451 Multilingual Support | |
452 | |
453 * Introduction to Multilingual Issues #1:: | |
454 * Introduction to Multilingual Issues #2:: | |
455 * Introduction to Multilingual Issues #3:: | |
456 * Introduction to Multilingual Issues #4:: | |
457 * Character Sets:: | |
458 * Encodings:: | |
459 * Internal Mule Encodings:: | |
460 * Byte/Character Types; Buffer Positions; Other Typedefs:: | |
461 * Internal Text API's:: | |
462 * Coding for Mule:: | |
463 * CCL:: | |
464 * Microsoft Windows-Related Multilingual Issues:: | |
465 * Modules for Internationalization:: | |
466 | |
467 Encodings | |
468 | |
469 * Japanese EUC (Extended Unix Code):: | |
470 * JIS7:: | |
471 | |
472 Internal Mule Encodings | |
473 | |
474 * Internal String Encoding:: | |
475 * Internal Character Encoding:: | |
476 | |
477 Byte/Character Types; Buffer Positions; Other Typedefs | |
478 | |
479 * Byte Types:: | |
480 * Different Ways of Seeing Internal Text:: | |
481 * Buffer Positions:: | |
482 * Other Typedefs:: | |
483 * Usage of the Various Representations:: | |
484 * Working With the Various Representations:: | |
485 | |
486 Internal Text API's | |
487 | |
488 * Basic internal-format API's:: | |
489 * The DFC API:: | |
490 * The Eistring API:: | |
491 | |
492 Coding for Mule | |
493 | |
494 * Character-Related Data Types:: | |
495 * Working With Character and Byte Positions:: | |
496 * Conversion to and from External Data:: | |
497 * General Guidelines for Writing Mule-Aware Code:: | |
498 * An Example of Mule-Aware Code:: | |
499 * Mule-izing Code:: | |
500 | |
501 Microsoft Windows-Related Multilingual Issues | |
502 | |
503 * Microsoft Documentation:: | |
504 * Locales:: | |
505 * More about code pages:: | |
506 * More about locales:: | |
507 * Unicode support under Windows:: | |
508 * The golden rules of writing Unicode-safe code:: | |
509 * The format of the locale in setlocale():: | |
510 * Random other Windows I18N docs:: | |
511 | |
512 Consoles; Devices; Frames; Windows | |
513 | |
514 * Introduction to Consoles; Devices; Frames; Windows:: | |
515 * Point:: | |
516 * Window Hierarchy:: | |
517 * The Window Object:: | |
518 * Modules for the Basic Displayable Lisp Objects:: | |
519 | |
520 The Redisplay Mechanism | |
521 | |
522 * Critical Redisplay Sections:: | |
523 * Line Start Cache:: | |
524 * Redisplay Piece by Piece:: | |
525 * Modules for the Redisplay Mechanism:: | |
526 * Modules for other Display-Related Lisp Objects:: | |
527 | |
528 Extents | |
529 | |
530 * Introduction to Extents:: Extents are ranges over text, with properties. | |
531 * Extent Ordering:: How extents are ordered internally. | |
532 * Format of the Extent Info:: The extent information in a buffer or string. | |
533 * Zero-Length Extents:: A weird special case. | |
534 * Mathematics of Extent Ordering:: A rigorous foundation. | |
535 * Extent Fragments:: Cached information useful for redisplay. | |
346 | 536 |
347 Events and the Event Loop | 537 Events and the Event Loop |
348 | 538 |
349 * Introduction to Events:: | 539 * Introduction to Events:: |
350 * Main Loop:: | 540 * Main Loop:: |
365 * Control-G (Quit) Checking:: | 555 * Control-G (Quit) Checking:: |
366 * Profiling:: | 556 * Profiling:: |
367 * Asynchronous Timeouts:: | 557 * Asynchronous Timeouts:: |
368 * Exiting:: | 558 * Exiting:: |
369 | 559 |
370 Evaluation; Stack Frames; Bindings | |
371 | |
372 * Evaluation:: | |
373 * Dynamic Binding; The specbinding Stack; Unwind-Protects:: | |
374 * Simple Special Forms:: | |
375 * Catch and Throw:: | |
376 | |
377 Symbols and Variables | |
378 | |
379 * Introduction to Symbols:: | |
380 * Obarrays:: | |
381 * Symbol Values:: | |
382 | |
383 Buffers | |
384 | |
385 * Introduction to Buffers:: A buffer holds a block of text such as a file. | |
386 * Buffer Lists:: Keeping track of all buffers. | |
387 * Markers and Extents:: Tagging locations within a buffer. | |
388 * The Buffer Object:: The Lisp object corresponding to a buffer. | |
389 | |
390 Text | |
391 | |
392 * The Text in a Buffer:: Representation of the text in a buffer. | |
393 * Ibytes and Ichars:: Representation of individual characters. | |
394 * Byte-Char Position Conversion:: | |
395 * Searching and Matching:: Higher-level algorithms. | |
396 | |
397 Multilingual Support | |
398 | |
399 * Introduction to Multilingual Issues #1:: | |
400 * Introduction to Multilingual Issues #2:: | |
401 * Introduction to Multilingual Issues #3:: | |
402 * Introduction to Multilingual Issues #4:: | |
403 * Character Sets:: | |
404 * Encodings:: | |
405 * Internal Mule Encodings:: | |
406 * Byte/Character Types; Buffer Positions; Other Typedefs:: | |
407 * Internal Text API's:: | |
408 * Coding for Mule:: | |
409 * CCL:: | |
410 * Modules for Internationalization:: | |
411 | |
412 Encodings | |
413 | |
414 * Japanese EUC (Extended Unix Code):: | |
415 * JIS7:: | |
416 | |
417 Internal Mule Encodings | |
418 | |
419 * Internal String Encoding:: | |
420 * Internal Character Encoding:: | |
421 | |
422 Byte/Character Types; Buffer Positions; Other Typedefs | |
423 | |
424 * Byte Types:: | |
425 * Different Ways of Seeing Internal Text:: | |
426 * Buffer Positions:: | |
427 * Other Typedefs:: | |
428 * Usage of the Various Representations:: | |
429 * Working With the Various Representations:: | |
430 | |
431 Internal Text API's | |
432 | |
433 * Basic internal-format API's:: | |
434 * The DFC API:: | |
435 * The Eistring API:: | |
436 | |
437 Coding for Mule | |
438 | |
439 * Character-Related Data Types:: | |
440 * Working With Character and Byte Positions:: | |
441 * Conversion to and from External Data:: | |
442 * General Guidelines for Writing Mule-Aware Code:: | |
443 * An Example of Mule-Aware Code:: | |
444 * Mule-izing Code:: | |
445 | |
446 Lstreams | 560 Lstreams |
447 | 561 |
448 * Creating an Lstream:: Creating an lstream object. | 562 * Creating an Lstream:: Creating an lstream object. |
449 * Lstream Types:: Different sorts of things that are streamed. | 563 * Lstream Types:: Different sorts of things that are streamed. |
450 * Lstream Functions:: Functions for working with lstreams. | 564 * Lstream Functions:: Functions for working with lstreams. |
451 * Lstream Methods:: Creating new lstream types. | 565 * Lstream Methods:: Creating new lstream types. |
452 | |
453 Consoles; Devices; Frames; Windows | |
454 | |
455 * Introduction to Consoles; Devices; Frames; Windows:: | |
456 * Point:: | |
457 * Window Hierarchy:: | |
458 * The Window Object:: | |
459 * Modules for the Basic Displayable Lisp Objects:: | |
460 | |
461 The Redisplay Mechanism | |
462 | |
463 * Critical Redisplay Sections:: | |
464 * Line Start Cache:: | |
465 * Redisplay Piece by Piece:: | |
466 * Modules for the Redisplay Mechanism:: | |
467 * Modules for other Display-Related Lisp Objects:: | |
468 | |
469 Extents | |
470 | |
471 * Introduction to Extents:: Extents are ranges over text, with properties. | |
472 * Extent Ordering:: How extents are ordered internally. | |
473 * Format of the Extent Info:: The extent information in a buffer or string. | |
474 * Zero-Length Extents:: A weird special case. | |
475 * Mathematics of Extent Ordering:: A rigorous foundation. | |
476 * Extent Fragments:: Cached information useful for redisplay. | |
477 | 566 |
478 Interface to MS Windows | 567 Interface to MS Windows |
479 | 568 |
480 * Different kinds of Windows environments:: | 569 * Different kinds of Windows environments:: |
481 * Windows Build Flags:: | 570 * Windows Build Flags:: |
494 * Menubars:: | 583 * Menubars:: |
495 * Checkboxes and Radio Buttons:: | 584 * Checkboxes and Radio Buttons:: |
496 * Progress Bars:: | 585 * Progress Bars:: |
497 * Tab Controls:: | 586 * Tab Controls:: |
498 | 587 |
588 Dumping | |
589 | |
590 * Dumping Justification:: | |
591 * Overview:: | |
592 * Data descriptions:: | |
593 * Dumping phase:: | |
594 * Reloading phase:: | |
595 * Remaining issues:: | |
596 | |
597 Dumping phase | |
598 | |
599 * Object inventory:: | |
600 * Address allocation:: | |
601 * The header:: | |
602 * Data dumping:: | |
603 * Pointers dumping:: | |
604 | |
499 Future Work | 605 Future Work |
500 | 606 |
607 * Future Work -- General Suggestions:: | |
501 * Future Work -- Elisp Compatibility Package:: | 608 * Future Work -- Elisp Compatibility Package:: |
502 * Future Work -- Drag-n-Drop:: | 609 * Future Work -- Drag-n-Drop:: |
503 * Future Work -- Standard Interface for Enabling Extensions:: | 610 * Future Work -- Standard Interface for Enabling Extensions:: |
504 * Future Work -- Better Initialization File Scheme:: | 611 * Future Work -- Better Initialization File Scheme:: |
505 * Future Work -- Keyword Parameters:: | 612 * Future Work -- Keyword Parameters:: |
543 | 650 |
544 Future Work -- Byte Code Snippets | 651 Future Work -- Byte Code Snippets |
545 | 652 |
546 * Future Work -- Autodetection:: | 653 * Future Work -- Autodetection:: |
547 * Future Work -- Conversion Error Detection:: | 654 * Future Work -- Conversion Error Detection:: |
655 * Future Work -- Unicode:: | |
548 * Future Work -- BIDI Support:: | 656 * Future Work -- BIDI Support:: |
549 * Future Work -- Localized Text/Messages:: | 657 * Future Work -- Localized Text/Messages:: |
550 | 658 |
551 Future Work -- Lisp Engine Replacement | 659 Future Work -- Lisp Engine Replacement |
552 | 660 |
553 * Future Work -- Lisp Engine Discussion:: | 661 * Future Work -- Lisp Engine Discussion:: |
554 * Future Work -- Lisp Engine Replacement -- Implementation:: | 662 * Future Work -- Lisp Engine Replacement -- Implementation:: |
663 * Future Work -- Startup File Modification by Packages:: | |
555 | 664 |
556 Future Work Discussion | 665 Future Work Discussion |
557 | 666 |
558 * Discussion -- garbage collection:: | 667 * Discussion -- garbage collection:: |
559 * Discussion -- glyphs:: | 668 * Discussion -- glyphs:: |
669 * Discussion -- Dialog Boxes:: | |
670 * Discussion -- Multilingual Issues:: | |
671 * Discussion -- Windows External Widget:: | |
672 * Discussion -- Packages:: | |
673 * Discussion -- Distribution Layout:: | |
560 | 674 |
561 Old Future Work | 675 Old Future Work |
562 | 676 |
563 * Future Work -- A Portable Unexec Replacement:: | 677 * Old Future Work -- A Portable Unexec Replacement:: |
564 * Future Work -- Indirect Buffers:: | 678 * Old Future Work -- Indirect Buffers:: |
565 * Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: | 679 * Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: |
566 * Future Work -- xemacs.org Mailing Address Changes:: | 680 * Old Future Work -- RTF Clipboard Support:: |
567 * Future Work -- Lisp callbacks from critical areas of the C code:: | 681 * Old Future Work -- xemacs.org Mailing Address Changes:: |
682 * Old Future Work -- Lisp callbacks from critical areas of the C code:: | |
568 | 683 |
569 @end detailmenu | 684 @end detailmenu |
570 @end menu | 685 @end menu |
571 | 686 |
572 @node Introduction, Authorship of XEmacs, Top, Top | 687 @node Introduction, Authorship of XEmacs, Top, Top |
595 the snapshot of the code you are looking at, and in the case of | 710 the snapshot of the code you are looking at, and in the case of |
596 contradictions between the code comments and the manual, @strong{always} | 711 contradictions between the code comments and the manual, @strong{always} |
597 assume that the code comments are correct. (Because of the proximity of | 712 assume that the code comments are correct. (Because of the proximity of |
598 the comments to the code, comments will rarely be out-of-date.) | 713 the comments to the code, comments will rarely be out-of-date.) |
599 | 714 |
715 The manual is organized in chapters which are broadly grouped into major | |
716 divisions: | |
717 | |
718 @enumerate | |
719 @item | |
720 First is the introduction, including this chapter and chapters on the | |
721 history and authorship of XEmacs. | |
722 @item | |
723 Next, starting with @ref{XEmacs from the Outside}, are some chapters | |
724 giving a broad overview of the internal workings of XEmacs and | |
725 documenting important information relevant to those working on the code. | |
726 @item | |
727 The remaining divisions document the nitty-gritty details of the | |
728 internal workings. First, starting with @ref{XEmacs from the Outside}, | |
729 is a division on the workings of the Lisp interpreter that drives | |
730 XEmacs. | |
731 @item | |
732 Next, starting with @ref{Buffers}, is a division on the parts of the | |
733 code specifically devoted to text processing, including multilingual | |
734 support (Mule). | |
735 @item | |
736 Afterwards, starting with @ref{Consoles; Devices; Frames; Windows}, is a | |
737 division covering the display mechanism and the objects and modules | |
738 relevant to this. | |
739 @item | |
740 Then, starting with @ref{Events and the Event Loop}, is a division | |
741 covering the interface between XEmacs and the outside world, including | |
742 user interactions, subprocesses, file I/O, interfaces to particular | |
743 windowing systems, and dumping. | |
744 @item | |
745 Finally, starting with @ref{Future Work}, is a division containing | |
746 proposals and discussion relating to future work on XEmacs. | |
747 @end enumerate | |
748 | |
600 This manual was primarily written by Ben Wing. Certain sections were | 749 This manual was primarily written by Ben Wing. Certain sections were |
601 written by others, including those mentioned on the title page as well | 750 written by others, including those mentioned on the title page as well |
602 as other coders. Some sections were lifted directly from comments in | 751 as other coders. Some sections were lifted directly from comments in |
603 the code, and in those cases we may not completely be aware of the | 752 the code, and in those cases we may not completely be aware of the |
604 authorship. In addition, due to the collaborative nature of XEmacs, | 753 authorship. In addition, due to the collaborative nature of XEmacs, |
613 @table @asis | 762 @table @asis |
614 @item Stephen Turnbull | 763 @item Stephen Turnbull |
615 Various cleanup work, mostly post-2000. Object-Oriented Techniques in | 764 Various cleanup work, mostly post-2000. Object-Oriented Techniques in |
616 XEmacs. A Reader's Guide to XEmacs Coding Conventions. Searching and | 765 XEmacs. A Reader's Guide to XEmacs Coding Conventions. Searching and |
617 Matching. Regression Testing XEmacs. Modules for Regression Testing. | 766 Matching. Regression Testing XEmacs. Modules for Regression Testing. |
618 Lucid Widget Library. | 767 Lucid Widget Library. A number of sections in the Future Work chapter. |
619 @item Martin Buchholz | 768 @item Martin Buchholz |
620 Various cleanup work, mostly pre-2001. Docs on inline functions. Docs | 769 Various cleanup work, mostly pre-2001. Docs on inline functions. Docs |
621 on dfc conversion functions (Conversion to and from External Data). | 770 on dfc conversion functions (Conversion to and from External Data). |
622 Improvements in support for non-ASCII (European) keysyms under X. | 771 Improvements in support for non-ASCII (European) keysyms under X. |
772 A section or two in the Future Work chapter. | |
623 @item Hrvoje Niksic | 773 @item Hrvoje Niksic |
624 Coding for Mule. | 774 Coding for Mule. |
625 @item Matthias Neubauer | 775 @item Matthias Neubauer |
626 Garbage Collection - Step by Step. | 776 Garbage Collection - Step by Step. |
627 @item Olivier Galibert | 777 @item Olivier Galibert |
630 Redisplay Piece by Piece. Glyphs. | 780 Redisplay Piece by Piece. Glyphs. |
631 @item Chuck Thompson | 781 @item Chuck Thompson |
632 Line Start Cache. | 782 Line Start Cache. |
633 @item Kenichi Handa | 783 @item Kenichi Handa |
634 CCL. | 784 CCL. |
785 @item Jamie Zawinski | |
786 A couple of sections in the Future Work chapter. | |
635 @end table | 787 @end table |
636 | 788 |
637 @node Authorship of XEmacs, A History of Emacs, Introduction, Top | 789 @node Authorship of XEmacs, A History of Emacs, Introduction, Top |
638 @chapter Authorship of XEmacs | 790 @chapter Authorship of XEmacs |
639 @cindex authorship, XEmacs | 791 @cindex authorship, XEmacs |
772 @item alloca.s | 924 @item alloca.s |
773 Inherited almost unchanged from FSF kept in sync up through 19.30 | 925 Inherited almost unchanged from FSF kept in sync up through 19.30 |
774 basically no changes for Xemacs. | 926 basically no changes for Xemacs. |
775 @end table | 927 @end table |
776 | 928 |
777 @node A History of Emacs, XEmacs From the Outside, Authorship of XEmacs, Top | 929 @node A History of Emacs, The XEmacs Split, Authorship of XEmacs, Top |
778 @chapter A History of Emacs | 930 @chapter A History of Emacs |
779 @cindex history of Emacs, a | 931 @cindex history of Emacs, a |
780 @cindex Emacs, a history of | 932 @cindex Emacs, a history of |
781 @cindex Hackers (Steven Levy) | 933 @cindex Hackers (Steven Levy) |
782 @cindex Levy, Steven | 934 @cindex Levy, Steven |
1342 version 21.2.45 released February 23, 2001. | 1494 version 21.2.45 released February 23, 2001. |
1343 @item | 1495 @item |
1344 version 21.2.46 released March 21, 2001. | 1496 version 21.2.46 released March 21, 2001. |
1345 @end itemize | 1497 @end itemize |
1346 | 1498 |
1347 @node XEmacs From the Outside, The Lisp Language, A History of Emacs, Top | 1499 @node The XEmacs Split, XEmacs from the Outside, A History of Emacs, Top |
1348 @chapter XEmacs From the Outside | 1500 @chapter The XEmacs Split |
1501 @cindex XEmacs split | |
1502 | |
1503 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
1504 | |
1505 @strong{NOTE NOTE NOTE}: The following is a @strong{highly} opinionated | |
1506 piece written by one of the main authors of XEmacs. This reflects his | |
1507 opinions, and his only! It is included here because it may help to | |
1508 clarify some of the issues that are keeping the two versions of Emacs | |
1509 separate. | |
1510 | |
1511 Many people look at the split between GNU Emacs and XEmacs and are | |
1512 convinced that the XEmacs team is being needlessly divisive and just needs | |
1513 to cooperate a bit with RMS, and the two versions of Emacs will merge. In | |
1514 fact there have been six to seven major attempts at merging, each running | |
1515 hundreds of messages long and all of them coming from the XEmacs side. All | |
1516 have failed because they have eventually come to the same conclusion, which | |
1517 is that RMS has no real interest in cooperation at all. If you work with | |
1518 him, you have to do it his way -- "my way or the highway". Specifically: | |
1519 | |
1520 @enumerate | |
1521 @item | |
1522 | |
1523 RMS insists on having legal papers signed for every bit of code that goes | |
1524 into GNU Emacs. RMS's lawyers have told him that every contribution over | |
1525 ten lines long requires legal papers. These papers cannot be filled out | |
1526 over to the web but must be done so in person and mailed to the FSF. | |
1527 Obviously this by itself has a tendency to inhibit contributions because of | |
1528 the hassle factor. Furthermore, many people (and especially organizations) | |
1529 are either hesitant to or refuse to sign legal papers, for reasons | |
1530 mentioned below. Because of these reasons, XEmacs has never enforced legal | |
1531 signed papers for the code in it. Such papers are not a part of the GPL and | |
1532 are not required by any projects other than those of the FSF (for example, | |
1533 Linux does not require such papers). Since we do not know exactly who is | |
1534 the author of every bit of code that has been contributed to XEmacs in the | |
1535 last nine years, we would essentially have to rewrite large sections of the | |
1536 code. The situation however, is worse than that because many of the large | |
1537 copyright holders of XEmacs (for example Sun Microsystems) refuse to sign | |
1538 legal papers. Although they have not stated their reasons, there are quite | |
1539 a number of reasons not to sign legal papers: | |
1540 | |
1541 @itemize @bullet | |
1542 @item | |
1543 By doing so you essentially give up all control over your code. You can | |
1544 no longer release your code under a different license. If you want to | |
1545 use your code that you've contributed to the FSF in a project of your | |
1546 own, and that project is not released under the GPL, you are not allowed | |
1547 to do this. Obviously, large companies tend to want to reuse their code | |
1548 in many different projects and as a result feel very uncomfortable about | |
1549 signing legal papers. | |
1550 @item | |
1551 One of the dangers of assigning copyright to the FSF is that if the FSF | |
1552 happens to be taken over by some evil corporate identity or anyone with | |
1553 different ideas than RMS, they will own all copyright-assigned code, and | |
1554 can revoke the GPL and enforce any license they please. If the code has | |
1555 many different copyright holders, this is much less likely of a | |
1556 scenario. | |
1557 @end itemize | |
1558 | |
1559 @item | |
1560 RMS does not like abstract data structures. Abstract data structures are | |
1561 the foundation of XEmacs and most other modern programming projects. In | |
1562 my opinion, is difficult to impossible to write maintainable and | |
1563 expandable code without using abstract data structures. In merging talks | |
1564 with RMS he has said we can have any abstract data structures we want in | |
1565 a merged version but must allow direct access to the implementation as | |
1566 well, which defeats the primary purpose of having abstract data | |
1567 structures. | |
1568 | |
1569 @item | |
1570 RMS is very unwilling to compromise when it comes to divergent | |
1571 implementations of the same functionality, which is very common between | |
1572 XEmacs and GNU Emacs. Rather than taking the better interface on | |
1573 technical grounds, RMS insists that both interfaces must be implemented | |
1574 in C at the same level (rather than implementing one in C and the other | |
1575 on top if it), so that code that uses either interface is just as | |
1576 fast. This means that the resulting merged Emacs would be filled with a | |
1577 lot of very complicated code to simultaneously support two divergent | |
1578 interfaces, and would be difficult to maintain in this state. | |
1579 | |
1580 @item | |
1581 RMS's idea of compromise and cooperation is almost purely political | |
1582 rather than technical. The XEmacs maintainers would like to have issues | |
1583 resolved by examining them technically and deciding what makes the most | |
1584 sense from a technical prospective. RMS however, wants to proceed on a | |
1585 tit for tat kind of basis, which is to say, “If we support this feature | |
1586 of yours, we also get to support this other feature of mine.” The | |
1587 result of such a process is typically a big mess, because there is no | |
1588 overarching design but instead a great deal of incompatible things | |
1589 hodgepodged together. | |
1590 @end enumerate | |
1591 | |
1592 If only some of the above differences were firmly held by RMS, and if he | |
1593 were willing to compromise effectively on the others and to demonstrate | |
1594 willingness to work with us on the issues that he is less willing to | |
1595 compromise on, we might go ahead with the merge despite misgivings. However | |
1596 RMS has shown no real interest at all in compromising. He has never stated | |
1597 how all of the redundant work that would be required to support his | |
1598 preconditions would get done. It's unlikely that he would do it all and | |
1599 it's certainly not clear that the XEmacs project would be willing to do it | |
1600 all, given that it is a tremendous amount of extra work and the XEmacs | |
1601 project is already strapped for coding resources. (Not to mention the | |
1602 inherent difficulty in convincing people to redo existing work for | |
1603 primarily political reasons.) In general the free software community is | |
1604 quite strapped as a whole for coding resources; duplicative efforts amount | |
1605 to very little positively and have a lot of negative effects in that they | |
1606 take away what few resources we do have from projects that would actually | |
1607 be useful. | |
1608 | |
1609 RMS however, does not seem to be bothered by this. He is more interested in | |
1610 sticking firm to his principles, though the heavens may fall down, than in | |
1611 working forward to create genuinely useful software. It is abundantly clear | |
1612 that RMS has no real interest in unity except if it happens to be on his | |
1613 own terms and allows him ultimate control over the result. He would rather | |
1614 see nothing happen at all than something that is not exactly according to | |
1615 his principles. The fact that few if any people share his principles is | |
1616 meaningless to him. | |
1617 | |
1618 @node XEmacs from the Outside, The Lisp Language, The XEmacs Split, Top | |
1619 @chapter XEmacs from the Outside | |
1349 @cindex XEmacs from the outside | 1620 @cindex XEmacs from the outside |
1350 @cindex outside, XEmacs from the | 1621 @cindex outside, XEmacs from the |
1351 @cindex read-eval-print | 1622 @cindex read-eval-print |
1352 | 1623 |
1353 XEmacs appears to the outside world as an editor, but it is really a | 1624 XEmacs appears to the outside world as an editor, but it is really a |
1386 @cindex pi, calculating | 1657 @cindex pi, calculating |
1387 Note that you do not have to use XEmacs as an editor; you could just | 1658 Note that you do not have to use XEmacs as an editor; you could just |
1388 as well make it do your taxes, compute pi, play bridge, etc. You'd just | 1659 as well make it do your taxes, compute pi, play bridge, etc. You'd just |
1389 have to write functions to do those operations in Lisp. | 1660 have to write functions to do those operations in Lisp. |
1390 | 1661 |
1391 @node The Lisp Language, XEmacs From the Perspective of Building, XEmacs From the Outside, Top | 1662 @node The Lisp Language, XEmacs from the Perspective of Building, XEmacs from the Outside, Top |
1392 @chapter The Lisp Language | 1663 @chapter The Lisp Language |
1393 @cindex Lisp language, the | 1664 @cindex Lisp language, the |
1394 @cindex Lisp vs. C | 1665 @cindex Lisp vs. C |
1395 @cindex C vs. Lisp | 1666 @cindex C vs. Lisp |
1396 @cindex Lisp vs. Java | 1667 @cindex Lisp vs. Java |
1608 The word @dfn{application} in the previous paragraph was used | 1879 The word @dfn{application} in the previous paragraph was used |
1609 intentionally. XEmacs implements an API for programs written in Lisp | 1880 intentionally. XEmacs implements an API for programs written in Lisp |
1610 that makes it a full-fledged application platform, very much like an OS | 1881 that makes it a full-fledged application platform, very much like an OS |
1611 inside the real OS. | 1882 inside the real OS. |
1612 | 1883 |
1613 @node XEmacs From the Perspective of Building, Build-Time Dependencies, The Lisp Language, Top | 1884 @node XEmacs from the Perspective of Building, Build-Time Dependencies, The Lisp Language, Top |
1614 @chapter XEmacs From the Perspective of Building | 1885 @chapter XEmacs from the Perspective of Building |
1615 @cindex XEmacs from the perspective of building | 1886 @cindex XEmacs from the perspective of building |
1616 @cindex building, XEmacs from the perspective of | 1887 @cindex building, XEmacs from the perspective of |
1617 | 1888 |
1618 The heart of XEmacs is the Lisp environment, which is written in C. | 1889 The heart of XEmacs is the Lisp environment, which is written in C. |
1619 This is contained in the @file{src/} subdirectory. Underneath | 1890 This is contained in the @file{src/} subdirectory. Underneath |
1719 This is useful when the dumping procedure described above is broken, or | 1990 This is useful when the dumping procedure described above is broken, or |
1720 when using certain program debugging tools such as Purify. These tools | 1991 when using certain program debugging tools such as Purify. These tools |
1721 get mighty confused by the tricks played by the XEmacs build process, | 1992 get mighty confused by the tricks played by the XEmacs build process, |
1722 such as allocating memory in one process, and freeing it in the next. | 1993 such as allocating memory in one process, and freeing it in the next. |
1723 | 1994 |
1724 @node Build-Time Dependencies, XEmacs From the Inside, XEmacs From the Perspective of Building, Top | 1995 @node Build-Time Dependencies, The Modules of XEmacs, XEmacs from the Perspective of Building, Top |
1725 @chapter Build-Time Dependencies | 1996 @chapter Build-Time Dependencies |
1726 @cindex build-time dependencies | 1997 @cindex build-time dependencies |
1727 @cindex dependencies, build-time | 1998 @cindex dependencies, build-time |
1728 | 1999 |
1729 This is a collection of random notes on build-time dependencies as of | 2000 This is a collection of random notes on build-time dependencies as of |
1783 use any higher-level functionality that might load @file{custom.el}, but | 2054 use any higher-level functionality that might load @file{custom.el}, but |
1784 you do not need @file{subr.el}, you should @samp{defvar} | 2055 you do not need @file{subr.el}, you should @samp{defvar} |
1785 @code{custom-declare-variable-list} to prevent the @samp{void-variable} | 2056 @code{custom-declare-variable-list} to prevent the @samp{void-variable} |
1786 error. (Currently this is only needed for @file{make-docfile.el}.) | 2057 error. (Currently this is only needed for @file{make-docfile.el}.) |
1787 | 2058 |
1788 @node XEmacs From the Inside, The XEmacs Object System (Abstractly Speaking), Build-Time Dependencies, Top | 2059 @node The Modules of XEmacs, Major Textual Changes, Build-Time Dependencies, Top |
1789 @chapter XEmacs From the Inside | |
1790 @cindex XEmacs from the inside | |
1791 @cindex inside, XEmacs from the | |
1792 | |
1793 Internally, XEmacs is quite complex, and can be very confusing. To | |
1794 simplify things, it can be useful to think of XEmacs as containing an | |
1795 event loop that ``drives'' everything, and a number of other subsystems, | |
1796 such as a Lisp engine and a redisplay mechanism. Each of these other | |
1797 subsystems exists simultaneously in XEmacs, and each has a certain | |
1798 state. The flow of control continually passes in and out of these | |
1799 different subsystems in the course of normal operation of the editor. | |
1800 | |
1801 It is important to keep in mind that, most of the time, the editor is | |
1802 ``driven'' by the event loop. Except during initialization and batch | |
1803 mode, all subsystems are entered directly or indirectly through the | |
1804 event loop, and ultimately, control exits out of all subsystems back up | |
1805 to the event loop. This cycle of entering a subsystem, exiting back out | |
1806 to the event loop, and starting another iteration of the event loop | |
1807 occurs once each keystroke, mouse motion, etc. | |
1808 | |
1809 If you're trying to understand a particular subsystem (other than the | |
1810 event loop), think of it as a ``daemon'' process or ``servant'' that is | |
1811 responsible for one particular aspect of a larger system, and | |
1812 periodically receives commands or environment changes that cause it to | |
1813 do something. Ultimately, these commands and environment changes are | |
1814 always triggered by the event loop. For example: | |
1815 | |
1816 @itemize @bullet | |
1817 @item | |
1818 The window and frame mechanism is responsible for keeping track of what | |
1819 windows and frames exist, what buffers are in them, etc. It is | |
1820 periodically given commands (usually from the user) to make a change to | |
1821 the current window/frame state: i.e. create a new frame, delete a | |
1822 window, etc. | |
1823 | |
1824 @item | |
1825 The buffer mechanism is responsible for keeping track of what buffers | |
1826 exist and what text is in them. It is periodically given commands | |
1827 (usually from the user) to insert or delete text, create a buffer, etc. | |
1828 When it receives a text-change command, it notifies the redisplay | |
1829 mechanism. | |
1830 | |
1831 @item | |
1832 The redisplay mechanism is responsible for making sure that windows and | |
1833 frames are displayed correctly. It is periodically told (by the event | |
1834 loop) to actually ``do its job'', i.e. snoop around and see what the | |
1835 current state of the environment (mostly of the currently-existing | |
1836 windows, frames, and buffers) is, and make sure that state matches | |
1837 what's actually displayed. It keeps lots and lots of information around | |
1838 (such as what is actually being displayed currently, and what the | |
1839 environment was last time it checked) so that it can minimize the work | |
1840 it has to do. It is also helped along in that whenever a relevant | |
1841 change to the environment occurs, the redisplay mechanism is told about | |
1842 this, so it has a pretty good idea of where it has to look to find | |
1843 possible changes and doesn't have to look everywhere. | |
1844 | |
1845 @item | |
1846 The Lisp engine is responsible for executing the Lisp code in which most | |
1847 user commands are written. It is entered through a call to @code{eval} | |
1848 or @code{funcall}, which occurs as a result of dispatching an event from | |
1849 the event loop. The functions it calls issue commands to the buffer | |
1850 mechanism, the window/frame subsystem, etc. | |
1851 | |
1852 @item | |
1853 The Lisp allocation subsystem is responsible for keeping track of Lisp | |
1854 objects. It is given commands from the Lisp engine to allocate objects, | |
1855 garbage collect, etc. | |
1856 @end itemize | |
1857 | |
1858 etc. | |
1859 | |
1860 The important idea here is that there are a number of independent | |
1861 subsystems each with its own responsibility and persistent state, just | |
1862 like different employees in a company, and each subsystem is | |
1863 periodically given commands from other subsystems. Commands can flow | |
1864 from any one subsystem to any other, but there is usually some sort of | |
1865 hierarchy, with all commands originating from the event subsystem. | |
1866 | |
1867 XEmacs is entered in @code{main()}, which is in @file{emacs.c}. When | |
1868 this is called the first time (in a properly-invoked @file{temacs}), it | |
1869 does the following: | |
1870 | |
1871 @enumerate | |
1872 @item | |
1873 It does some very basic environment initializations, such as determining | |
1874 where it and its directories (e.g. @file{lisp/} and @file{etc/}) reside | |
1875 and setting up signal handlers. | |
1876 @item | |
1877 It initializes the entire Lisp interpreter. | |
1878 @item | |
1879 It sets the initial values of many built-in variables (including many | |
1880 variables that are visible to Lisp programs), such as the global keymap | |
1881 object and the built-in faces (a face is an object that describes the | |
1882 display characteristics of text). This involves creating Lisp objects | |
1883 and thus is dependent on step (2). | |
1884 @item | |
1885 It performs various other initializations that are relevant to the | |
1886 particular environment it is running in, such as retrieving environment | |
1887 variables, determining the current date and the user who is running the | |
1888 program, examining its standard input, creating any necessary file | |
1889 descriptors, etc. | |
1890 @item | |
1891 At this point, the C initialization is complete. A Lisp program that | |
1892 was specified on the command line (usually @file{loadup.el}) is called | |
1893 (temacs is normally invoked as @code{temacs -batch -l loadup.el dump}). | |
1894 @file{loadup.el} loads all of the other Lisp files that are needed for | |
1895 the operation of the editor, calls the @code{dump-emacs} function to | |
1896 write out @file{xemacs}, and then kills the temacs process. | |
1897 @end enumerate | |
1898 | |
1899 When @file{xemacs} is then run, it only redoes steps (1) and (4) | |
1900 above; all variables already contain the values they were set to when | |
1901 the executable was dumped, and all memory that was allocated with | |
1902 @code{malloc()} is still around. (XEmacs knows whether it is being run | |
1903 as @file{xemacs} or @file{temacs} because it sets the global variable | |
1904 @code{initialized} to 1 after step (4) above.) At this point, | |
1905 @file{xemacs} calls a Lisp function to do any further initialization, | |
1906 which includes parsing the command-line (the C code can only do limited | |
1907 command-line parsing, which includes looking for the @samp{-batch} and | |
1908 @samp{-l} flags and a few other flags that it needs to know about before | |
1909 initialization is complete), creating the first frame (or @dfn{window} | |
1910 in standard window-system parlance), running the user's init file | |
1911 (usually the file @file{.emacs} in the user's home directory), etc. The | |
1912 function to do this is usually called @code{normal-top-level}; | |
1913 @file{loadup.el} tells the C code about this function by setting its | |
1914 name as the value of the Lisp variable @code{top-level}. | |
1915 | |
1916 When the Lisp initialization code is done, the C code enters the event | |
1917 loop, and stays there for the duration of the XEmacs process. The code | |
1918 for the event loop is contained in @file{cmdloop.c}, and is called | |
1919 @code{Fcommand_loop_1()}. Note that this event loop could very well be | |
1920 written in Lisp, and in fact a Lisp version exists; but apparently, | |
1921 doing this makes XEmacs run noticeably slower. | |
1922 | |
1923 Notice how much of the initialization is done in Lisp, not in C. | |
1924 In general, XEmacs tries to move as much code as is possible | |
1925 into Lisp. Code that remains in C is code that implements the | |
1926 Lisp interpreter itself, or code that needs to be very fast, or | |
1927 code that needs to do system calls or other such stuff that | |
1928 needs to be done in C, or code that needs to have access to | |
1929 ``forbidden'' structures. (One conscious aspect of the design of | |
1930 Lisp under XEmacs is a clean separation between the external | |
1931 interface to a Lisp object's functionality and its internal | |
1932 implementation. Part of this design is that Lisp programs | |
1933 are forbidden from accessing the contents of the object other | |
1934 than through using a standard API. In this respect, XEmacs Lisp | |
1935 is similar to modern Lisp dialects but differs from GNU Emacs, | |
1936 which tends to expose the implementation and allow Lisp | |
1937 programs to look at it directly. The major advantage of | |
1938 hiding the implementation is that it allows the implementation | |
1939 to be redesigned without affecting any Lisp programs, including | |
1940 those that might want to be ``clever'' by looking directly at | |
1941 the object's contents and possibly manipulating them.) | |
1942 | |
1943 Moving code into Lisp makes the code easier to debug and maintain and | |
1944 makes it much easier for people who are not XEmacs developers to | |
1945 customize XEmacs, because they can make a change with much less chance | |
1946 of obscure and unwanted interactions occurring than if they were to | |
1947 change the C code. | |
1948 | |
1949 @node The XEmacs Object System (Abstractly Speaking), How Lisp Objects Are Represented in C, XEmacs From the Inside, Top | |
1950 @chapter The XEmacs Object System (Abstractly Speaking) | |
1951 @cindex XEmacs object system (abstractly speaking), the | |
1952 @cindex object system (abstractly speaking), the XEmacs | |
1953 | |
1954 At the heart of the Lisp interpreter is its management of objects. | |
1955 XEmacs Lisp contains many built-in objects, some of which are | |
1956 simple and others of which can be very complex; and some of which | |
1957 are very common, and others of which are rarely used or are only | |
1958 used internally. (Since the Lisp allocation system, with its | |
1959 automatic reclamation of unused storage, is so much more convenient | |
1960 than @code{malloc()} and @code{free()}, the C code makes extensive use of it | |
1961 in its internal operations.) | |
1962 | |
1963 The basic Lisp objects are | |
1964 | |
1965 @table @code | |
1966 @item integer | |
1967 31 bits of precision, or 63 bits on 64-bit machines; the | |
1968 reason for this is described below when the internal Lisp object | |
1969 representation is described. | |
1970 @item char | |
1971 An object representing a single character of text; chars behave like | |
1972 integers in many ways but are logically considered text rather than | |
1973 numbers and have a different read syntax. (the read syntax for a char | |
1974 contains the char itself or some textual encoding of it---for example, | |
1975 a Japanese Kanji character might be encoded as @samp{^[$(B#&^[(B} using the | |
1976 ISO-2022 encoding standard---rather than the numerical representation | |
1977 of the char; this way, if the mapping between chars and integers | |
1978 changes, which is quite possible for Kanji characters and other extended | |
1979 characters, the same character will still be created. Note that some | |
1980 primitives confuse chars and integers. The worst culprit is @code{eq}, | |
1981 which makes a special exception and considers a char to be @code{eq} to | |
1982 its integer equivalent, even though in no other case are objects of two | |
1983 different types @code{eq}. The reason for this monstrosity is | |
1984 compatibility with existing code; the separation of char from integer | |
1985 came fairly recently.) | |
1986 @item float | |
1987 Same precision as a double in C. | |
1988 @item bignum | |
1989 @itemx ratio | |
1990 @itemx bigfloat | |
1991 As build-time options, arbitrary-precision numbers are available. | |
1992 Bignums are integers, and when available they remove the restriction on | |
1993 buffer size. Ratios are non-integral rational numbers. Bigfloats are | |
1994 arbitrary-precision floating point numbers, with precision specified at | |
1995 runtime. | |
1996 @item symbol | |
1997 An object that contains Lisp objects and is referred to by name; | |
1998 symbols are used to implement variables and named functions | |
1999 and to provide the equivalent of preprocessor constants in C. | |
2000 @item string | |
2001 Self-explanatory; behaves much like a vector of chars | |
2002 but has a different read syntax and is stored and manipulated | |
2003 more compactly. | |
2004 @item bit-vector | |
2005 A vector of bits; similar to a string in spirit. | |
2006 @item vector | |
2007 A one-dimensional array of Lisp objects providing constant-time access | |
2008 to any of the objects; access to an arbitrary object in a vector is | |
2009 faster than for lists, but the operations that can be done on a vector | |
2010 are more limited. | |
2011 @item compiled-function | |
2012 An object containing compiled Lisp code, known as @dfn{byte code}. | |
2013 @item subr | |
2014 A Lisp primitive, i.e. a Lisp-callable function implemented in C. | |
2015 @item cons | |
2016 A simple container for two Lisp objects, used to implement lists and | |
2017 most other data structures in Lisp. | |
2018 @end table | |
2019 | |
2020 Objects which are not conses are called atoms. | |
2021 | |
2022 @cindex closure | |
2023 Note that there is no basic ``function'' type, as in more powerful | |
2024 versions of Lisp (where it's called a @dfn{closure}). XEmacs Lisp does | |
2025 not provide the closure semantics implemented by Common Lisp and Scheme. | |
2026 The guts of a function in XEmacs Lisp are represented in one of four | |
2027 ways: a symbol specifying another function (when one function is an | |
2028 alias for another), a list (whose first element must be the symbol | |
2029 @code{lambda}) containing the function's source code, a | |
2030 compiled-function object, or a subr object. (In other words, given a | |
2031 symbol specifying the name of a function, calling @code{symbol-function} | |
2032 to retrieve the contents of the symbol's function cell will return one | |
2033 of these types of objects.) | |
2034 | |
2035 XEmacs Lisp also contains numerous specialized objects used to implement | |
2036 the editor: | |
2037 | |
2038 @table @code | |
2039 @item buffer | |
2040 Stores text like a string, but is optimized for insertion and deletion | |
2041 and has certain other properties that can be set. | |
2042 @item frame | |
2043 An object with various properties whose displayable representation is a | |
2044 @dfn{window} in window-system parlance. | |
2045 @item window | |
2046 A section of a frame that displays the contents of a buffer; | |
2047 often called a @dfn{pane} in window-system parlance. | |
2048 @item window-configuration | |
2049 An object that represents a saved configuration of windows in a frame. | |
2050 @item device | |
2051 An object representing a screen on which frames can be displayed; | |
2052 equivalent to a @dfn{display} in the X Window System and a @dfn{TTY} in | |
2053 character mode. | |
2054 @item face | |
2055 An object specifying the appearance of text or graphics; it has | |
2056 properties such as font, foreground color, and background color. | |
2057 @item marker | |
2058 An object that refers to a particular position in a buffer and moves | |
2059 around as text is inserted and deleted to stay in the same relative | |
2060 position to the text around it. | |
2061 @item extent | |
2062 Similar to a marker but covers a range of text in a buffer; can also | |
2063 specify properties of the text, such as a face in which the text is to | |
2064 be displayed, whether the text is invisible or unmodifiable, etc. | |
2065 @item event | |
2066 Generated by calling @code{next-event} and contains information | |
2067 describing a particular event happening in the system, such as the user | |
2068 pressing a key or a process terminating. | |
2069 @item keymap | |
2070 An object that maps from events (described using lists, vectors, and | |
2071 symbols rather than with an event object because the mapping is for | |
2072 classes of events, rather than individual events) to functions to | |
2073 execute or other events to recursively look up; the functions are | |
2074 described by name, using a symbol, or using lists to specify the | |
2075 function's code. | |
2076 @item glyph | |
2077 An object that describes the appearance of an image (e.g. pixmap) on | |
2078 the screen; glyphs can be attached to the beginning or end of extents | |
2079 and in some future version of XEmacs will be able to be inserted | |
2080 directly into a buffer. | |
2081 @item process | |
2082 An object that describes a connection to an externally-running process. | |
2083 @end table | |
2084 | |
2085 There are some other, less-commonly-encountered general objects: | |
2086 | |
2087 @table @code | |
2088 @item hash-table | |
2089 An object that maps from an arbitrary Lisp object to another arbitrary | |
2090 Lisp object, using hashing for fast lookup. | |
2091 @item obarray | |
2092 A limited form of hash-table that maps from strings to symbols; obarrays | |
2093 are used to look up a symbol given its name and are not actually their | |
2094 own object type but are kludgily represented using vectors with hidden | |
2095 fields (this representation derives from GNU Emacs). | |
2096 @item specifier | |
2097 A complex object used to specify the value of a display property; a | |
2098 default value is given and different values can be specified for | |
2099 particular frames, buffers, windows, devices, or classes of device. | |
2100 @item char-table | |
2101 An object that maps from chars or classes of chars to arbitrary Lisp | |
2102 objects; internally char tables use a complex nested-vector | |
2103 representation that is optimized to the way characters are represented | |
2104 as integers. | |
2105 @item range-table | |
2106 An object that maps from ranges of integers to arbitrary Lisp objects. | |
2107 @end table | |
2108 | |
2109 And some strange special-purpose objects: | |
2110 | |
2111 @table @code | |
2112 @item charset | |
2113 @itemx coding-system | |
2114 Objects used when MULE, or multi-lingual/Asian-language, support is | |
2115 enabled. | |
2116 @item color-instance | |
2117 @itemx font-instance | |
2118 @itemx image-instance | |
2119 An object that encapsulates a window-system resource; instances are | |
2120 mostly used internally but are exposed on the Lisp level for cleanness | |
2121 of the specifier model and because it's occasionally useful for Lisp | |
2122 program to create or query the properties of instances. | |
2123 @item subwindow | |
2124 An object that encapsulate a @dfn{subwindow} resource, i.e. a | |
2125 window-system child window that is drawn into by an external process; | |
2126 this object should be integrated into the glyph system but isn't yet, | |
2127 and may change form when this is done. | |
2128 @item tooltalk-message | |
2129 @itemx tooltalk-pattern | |
2130 Objects that represent resources used in the ToolTalk interprocess | |
2131 communication protocol. | |
2132 @item toolbar-button | |
2133 An object used in conjunction with the toolbar. | |
2134 @end table | |
2135 | |
2136 And objects that are only used internally: | |
2137 | |
2138 @table @code | |
2139 @item opaque | |
2140 A generic object for encapsulating arbitrary memory; this allows you the | |
2141 generality of @code{malloc()} and the convenience of the Lisp object | |
2142 system. | |
2143 @item lstream | |
2144 A buffering I/O stream, used to provide a unified interface to anything | |
2145 that can accept output or provide input, such as a file descriptor, a | |
2146 stdio stream, a chunk of memory, a Lisp buffer, a Lisp string, etc.; | |
2147 it's a Lisp object to make its memory management more convenient. | |
2148 @item char-table-entry | |
2149 Subsidiary objects in the internal char-table representation. | |
2150 @item extent-auxiliary | |
2151 @itemx menubar-data | |
2152 @itemx toolbar-data | |
2153 Various special-purpose objects that are basically just used to | |
2154 encapsulate memory for particular subsystems, similar to the more | |
2155 general ``opaque'' object. | |
2156 @item symbol-value-forward | |
2157 @itemx symbol-value-buffer-local | |
2158 @itemx symbol-value-varalias | |
2159 @itemx symbol-value-lisp-magic | |
2160 Special internal-only objects that are placed in the value cell of a | |
2161 symbol to indicate that there is something special with this variable -- | |
2162 e.g. it has no value, it mirrors another variable, or it mirrors some C | |
2163 variable; there is really only one kind of object, called a | |
2164 @dfn{symbol-value-magic}, but it is sort-of halfway kludged into | |
2165 semi-different object types. | |
2166 @end table | |
2167 | |
2168 @cindex permanent objects | |
2169 @cindex temporary objects | |
2170 Some types of objects are @dfn{permanent}, meaning that once created, | |
2171 they do not disappear until explicitly destroyed, using a function such | |
2172 as @code{delete-buffer}, @code{delete-window}, @code{delete-frame}, etc. | |
2173 Others will disappear once they are not longer used, through the garbage | |
2174 collection mechanism. Buffers, frames, windows, devices, and processes | |
2175 are among the objects that are permanent. Note that some objects can go | |
2176 both ways: Faces can be created either way; extents are normally | |
2177 permanent, but detached extents (extents not referring to any text, as | |
2178 happens to some extents when the text they are referring to is deleted) | |
2179 are temporary. Note that some permanent objects, such as faces and | |
2180 coding systems, cannot be deleted. Note also that windows are unique in | |
2181 that they can be @emph{undeleted} after having previously been | |
2182 deleted. (This happens as a result of restoring a window configuration.) | |
2183 | |
2184 @cindex read syntax | |
2185 Many types of objects have a @dfn{read syntax}, i.e. a way of | |
2186 specifying an object of that type in Lisp code. When you load a Lisp | |
2187 file, or type in code to be evaluated, what really happens is that the | |
2188 function @code{read} is called, which reads some text and creates an object | |
2189 based on the syntax of that text; then @code{eval} is called, which | |
2190 possibly does something special; then this loop repeats until there's | |
2191 no more text to read. (@code{eval} only actually does something special | |
2192 with symbols, which causes the symbol's value to be returned, | |
2193 similar to referencing a variable; and with conses [i.e. lists], | |
2194 which cause a function invocation. All other values are returned | |
2195 unchanged.) | |
2196 | |
2197 The read syntax | |
2198 | |
2199 @example | |
2200 17297 | |
2201 @end example | |
2202 | |
2203 converts to an integer whose value is 17297. | |
2204 | |
2205 @example | |
2206 355/113 | |
2207 @end example | |
2208 | |
2209 converts to a ratio commonly used to approximate @emph{pi} when ratios | |
2210 are configured, and otherwise to a symbol whose name is ``355/113'' (for | |
2211 backward compatibility). | |
2212 | |
2213 @example | |
2214 1.983e-4 | |
2215 @end example | |
2216 | |
2217 converts to a float whose value is 1.983e-4, or .0001983. | |
2218 | |
2219 @example | |
2220 ?b | |
2221 @end example | |
2222 | |
2223 converts to a char that represents the lowercase letter b. | |
2224 | |
2225 @example | |
2226 ?^[$(B#&^[(B | |
2227 @end example | |
2228 | |
2229 (where @samp{^[} actually is an @samp{ESC} character) converts to a | |
2230 particular Kanji character when using an ISO2022-based coding system for | |
2231 input. (To decode this goo: @samp{ESC} begins an escape sequence; | |
2232 @samp{ESC $ (} is a class of escape sequences meaning ``switch to a | |
2233 94x94 character set''; @samp{ESC $ ( B} means ``switch to Japanese | |
2234 Kanji''; @samp{#} and @samp{&} collectively index into a 94-by-94 array | |
2235 of characters [subtract 33 from the ASCII value of each character to get | |
2236 the corresponding index]; @samp{ESC (} is a class of escape sequences | |
2237 meaning ``switch to a 94 character set''; @samp{ESC (B} means ``switch | |
2238 to US ASCII''. It is a coincidence that the letter @samp{B} is used to | |
2239 denote both Japanese Kanji and US ASCII. If the first @samp{B} were | |
2240 replaced with an @samp{A}, you'd be requesting a Chinese Hanzi character | |
2241 from the GB2312 character set.) | |
2242 | |
2243 @example | |
2244 "foobar" | |
2245 @end example | |
2246 | |
2247 converts to a string. | |
2248 | |
2249 @example | |
2250 foobar | |
2251 @end example | |
2252 | |
2253 converts to a symbol whose name is @code{"foobar"}. This is done by | |
2254 looking up the string equivalent in the global variable | |
2255 @code{obarray}, whose contents should be an obarray. If no symbol | |
2256 is found, a new symbol with the name @code{"foobar"} is automatically | |
2257 created and added to @code{obarray}; this process is called | |
2258 @dfn{interning} the symbol. | |
2259 @cindex interning | |
2260 | |
2261 @example | |
2262 (foo . bar) | |
2263 @end example | |
2264 | |
2265 converts to a cons cell containing the symbols @code{foo} and @code{bar}. | |
2266 | |
2267 @example | |
2268 (1 a 2.5) | |
2269 @end example | |
2270 | |
2271 converts to a three-element list containing the specified objects | |
2272 (note that a list is actually a set of nested conses; see the | |
2273 XEmacs Lisp Reference). | |
2274 | |
2275 @example | |
2276 [1 a 2.5] | |
2277 @end example | |
2278 | |
2279 converts to a three-element vector containing the specified objects. | |
2280 | |
2281 @example | |
2282 #[... ... ... ...] | |
2283 @end example | |
2284 | |
2285 converts to a compiled-function object (the actual contents are not | |
2286 shown since they are not relevant here; look at a file that ends with | |
2287 @file{.elc} for examples). | |
2288 | |
2289 @example | |
2290 #*01110110 | |
2291 @end example | |
2292 | |
2293 converts to a bit-vector. | |
2294 | |
2295 @example | |
2296 #s(hash-table ... ...) | |
2297 @end example | |
2298 | |
2299 converts to a hash table (the actual contents are not shown). | |
2300 | |
2301 @example | |
2302 #s(range-table ... ...) | |
2303 @end example | |
2304 | |
2305 converts to a range table (the actual contents are not shown). | |
2306 | |
2307 @example | |
2308 #s(char-table ... ...) | |
2309 @end example | |
2310 | |
2311 converts to a char table (the actual contents are not shown). | |
2312 | |
2313 Note that the @code{#s()} syntax is the general syntax for structures, | |
2314 which are not really implemented in XEmacs Lisp but should be. | |
2315 | |
2316 When an object is printed out (using @code{print} or a related | |
2317 function), the read syntax is used, so that the same object can be read | |
2318 in again. | |
2319 | |
2320 The other objects do not have read syntaxes, usually because it does not | |
2321 really make sense to create them in this fashion (i.e. processes, where | |
2322 it doesn't make sense to have a subprocess created as a side effect of | |
2323 reading some Lisp code), or because they can't be created at all | |
2324 (e.g. subrs). Permanent objects, as a rule, do not have a read syntax; | |
2325 nor do most complex objects, which contain too much state to be easily | |
2326 initialized through a read syntax. | |
2327 | |
2328 @node How Lisp Objects Are Represented in C, Major Textual Changes, The XEmacs Object System (Abstractly Speaking), Top | |
2329 @chapter How Lisp Objects Are Represented in C | |
2330 @cindex Lisp objects are represented in C, how | |
2331 @cindex objects are represented in C, how Lisp | |
2332 @cindex represented in C, how Lisp objects are | |
2333 | |
2334 Lisp objects are represented in C using a 32-bit or 64-bit machine word | |
2335 (depending on the processor; i.e. DEC Alphas use 64-bit Lisp objects and | |
2336 most other processors use 32-bit Lisp objects). The representation | |
2337 stuffs a pointer together with a tag, as follows: | |
2338 | |
2339 @example | |
2340 [ 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ] | |
2341 [ 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ] | |
2342 | |
2343 <---------------------------------------------------------> <-> | |
2344 a pointer to a structure, or an integer tag | |
2345 @end example | |
2346 | |
2347 A tag of 00 is used for all pointer object types, a tag of 10 is used | |
2348 for characters, and the other two tags 01 and 11 are joined together to | |
2349 form the integer object type. This representation gives us 31 bit | |
2350 integers and 30 bit characters, while pointers are represented directly | |
2351 without any bit masking or shifting. This representation, though, | |
2352 assumes that pointers to structs are always aligned to multiples of 4, | |
2353 so the lower 2 bits are always zero. | |
2354 | |
2355 Lisp objects use the typedef @code{Lisp_Object}, but the actual C type | |
2356 used for the Lisp object can vary. It can be either a simple type | |
2357 (@code{long} on the DEC Alpha, @code{int} on other machines) or a | |
2358 structure whose fields are bit fields that line up properly (actually, a | |
2359 union of structures is used). Generally the simple integral type is | |
2360 preferable because it ensures that the compiler will actually use a | |
2361 machine word to represent the object (some compilers will use more | |
2362 general and less efficient code for unions and structs even if they can | |
2363 fit in a machine word). The union type, however, has the advantage of | |
2364 stricter type checking. If you accidentally pass an integer where a Lisp | |
2365 object is desired, you get a compile error. The choice of which type | |
2366 to use is determined by the preprocessor constant @code{USE_UNION_TYPE} | |
2367 which is defined via the @code{--use-union-type} option to | |
2368 @code{configure}. | |
2369 | |
2370 Various macros are used to convert between Lisp_Objects and the | |
2371 corresponding C type. Macros of the form @code{XINT()}, @code{XCHAR()}, | |
2372 @code{XSTRING()}, @code{XSYMBOL()}, do any required bit shifting and/or | |
2373 masking and cast it to the appropriate type. @code{XINT()} needs to be | |
2374 a bit tricky so that negative numbers are properly sign-extended. Since | |
2375 integers are stored left-shifted, if the right-shift operator does an | |
2376 arithmetic shift (i.e. it leaves the most-significant bit as-is rather | |
2377 than shifting in a zero, so that it mimics a divide-by-two even for | |
2378 negative numbers) the shift to remove the tag bit is enough. This is | |
2379 the case on all the systems we support. | |
2380 | |
2381 Note that when @code{ERROR_CHECK_TYPECHECK} is defined, the converter | |
2382 macros become more complicated---they check the tag bits and/or the | |
2383 type field in the first four bytes of a record type to ensure that the | |
2384 object is really of the correct type. This is great for catching places | |
2385 where an incorrect type is being dereferenced---this typically results | |
2386 in a pointer being dereferenced as the wrong type of structure, with | |
2387 unpredictable (and sometimes not easily traceable) results. | |
2388 | |
2389 There are similar @code{XSET@var{TYPE}()} macros that construct a Lisp | |
2390 object. These macros are of the form @code{XSET@var{TYPE} | |
2391 (@var{lvalue}, @var{result})}, i.e. they have to be a statement rather | |
2392 than just used in an expression. The reason for this is that standard C | |
2393 doesn't let you ``construct'' a structure (but GCC does). Granted, this | |
2394 sometimes isn't too convenient; for the case of integers, at least, you | |
2395 can use the function @code{make_int()}, which constructs and | |
2396 @emph{returns} an integer Lisp object. Note that the | |
2397 @code{XSET@var{TYPE}()} macros are also affected by | |
2398 @code{ERROR_CHECK_TYPECHECK} and make sure that the structure is of the | |
2399 right type in the case of record types, where the type is contained in | |
2400 the structure. | |
2401 | |
2402 The C programmer is responsible for @strong{guaranteeing} that a | |
2403 Lisp_Object is the correct type before using the @code{X@var{TYPE}} | |
2404 macros. This is especially important in the case of lists. Use | |
2405 @code{XCAR} and @code{XCDR} if a Lisp_Object is certainly a cons cell, | |
2406 else use @code{Fcar()} and @code{Fcdr()}. Trust other C code, but not | |
2407 Lisp code. On the other hand, if XEmacs has an internal logic error, | |
2408 it's better to crash immediately, so sprinkle @code{assert()}s and | |
2409 ``unreachable'' @code{abort()}s liberally about the source code. Where | |
2410 performance is an issue, use @code{type_checking_assert}, | |
2411 @code{bufpos_checking_assert}, and @code{gc_checking_assert}, which do | |
2412 nothing unless the corresponding configure error checking flag was | |
2413 specified. | |
2414 | |
2415 @node Major Textual Changes, Rules When Writing New C Code, How Lisp Objects Are Represented in C, Top | |
2416 @chapter Major Textual Changes | |
2417 @cindex textual changes, major | |
2418 @cindex major textual changes | |
2419 | |
2420 Sometimes major textual changes are made to the source. This means that | |
2421 a search-and-replace is done to change type names and such. Some people | |
2422 disagree with such changes, and certainly if done without good reason | |
2423 will just lead to headaches. But it's important to keep the code clean | |
2424 and understable, and consistent naming goes a long way towards this. | |
2425 | |
2426 An example of the right way to do this was the so-called "great integral | |
2427 type renaming". | |
2428 | |
2429 @menu | |
2430 * Great Integral Type Renaming:: | |
2431 * Text/Char Type Renaming:: | |
2432 @end menu | |
2433 | |
2434 @node Great Integral Type Renaming, Text/Char Type Renaming, Major Textual Changes, Major Textual Changes | |
2435 @section Great Integral Type Renaming | |
2436 @cindex Great Integral Type Renaming | |
2437 @cindex integral type renaming, great | |
2438 @cindex type renaming, integral | |
2439 @cindex renaming, integral types | |
2440 | |
2441 The purpose of this is to rationalize the names used for various | |
2442 integral types, so that they match their intended uses and follow | |
2443 consist conventions, and eliminate types that were not semantically | |
2444 different from each other. | |
2445 | |
2446 The conventions are: | |
2447 | |
2448 @itemize @bullet | |
2449 @item | |
2450 All integral types that measure quantities of anything are signed. Some | |
2451 people disagree vociferously with this, but their arguments are mostly | |
2452 theoretical, and are vastly outweighed by the practical headaches of | |
2453 mixing signed and unsigned values, and more importantly by the far | |
2454 increased likelihood of inadvertent bugs: Because of the broken "viral" | |
2455 nature of unsigned quantities in C (operations involving mixed | |
2456 signed/unsigned are done unsigned, when exactly the opposite is nearly | |
2457 always wanted), even a single error in declaring a quantity unsigned | |
2458 that should be signed, or even the even more subtle error of comparing | |
2459 signed and unsigned values and forgetting the necessary cast, can be | |
2460 catastrophic, as comparisons will yield wrong results. -Wsign-compare | |
2461 is turned on specifically to catch this, but this tends to result in a | |
2462 great number of warnings when mixing signed and unsigned, and the casts | |
2463 are annoying. More has been written on this elsewhere. | |
2464 | |
2465 @item | |
2466 All such quantity types just mentioned boil down to EMACS_INT, which is | |
2467 32 bits on 32-bit machines and 64 bits on 64-bit machines. This is | |
2468 guaranteed to be the same size as Lisp objects of type @code{int}, and (as | |
2469 far as I can tell) of size_t (unsigned!) and ssize_t. The only type | |
2470 below that is not an EMACS_INT is Hashcode, which is an unsigned value | |
2471 of the same size as EMACS_INT. | |
2472 | |
2473 @item | |
2474 Type names should be relatively short (no more than 10 characters or | |
2475 so), with the first letter capitalized and no underscores if they can at | |
2476 all be avoided. | |
2477 | |
2478 @item | |
2479 "count" == a zero-based measurement of some quantity. Includes sizes, | |
2480 offsets, and indexes. | |
2481 | |
2482 @item | |
2483 "bpos" == a one-based measurement of a position in a buffer. "Charbpos" | |
2484 and "Bytebpos" count text in the buffer, rather than bytes in memory; | |
2485 thus Bytebpos does not directly correspond to the memory representation. | |
2486 Use "Membpos" for this. | |
2487 | |
2488 @item | |
2489 "Char" refers to internal-format characters, not to the C type "char", | |
2490 which is really a byte. | |
2491 @end itemize | |
2492 | |
2493 For the actual name changes, see the script below. | |
2494 | |
2495 I ran the following script to do the conversion. (NOTE: This script is | |
2496 idempotent. You can safely run it multiple times and it will not screw | |
2497 up previous results -- in fact, it will do nothing if nothing has | |
2498 changed. Thus, it can be run repeatedly as necessary to handle patches | |
2499 coming in from old workspaces, or old branches.) There are two tags, | |
2500 just before and just after the change: @samp{pre-integral-type-rename} | |
2501 and @samp{post-integral-type-rename}. When merging code from the main | |
2502 trunk into a branch, the best thing to do is first merge up to | |
2503 @samp{pre-integral-type-rename}, then apply the script and associated | |
2504 changes, then merge from @samp{post-integral-type-change} to the | |
2505 present. (Alternatively, just do the merging in one operation; but you | |
2506 may then have a lot of conflicts needing to be resolved by hand.) | |
2507 | |
2508 Script @samp{fixtypes.sh} follows: | |
2509 | |
2510 @example | |
2511 ----------------------------------- cut ------------------------------------ | |
2512 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
2513 gr Memory_Count Bytecount $files | |
2514 gr Lstream_Data_Count Bytecount $files | |
2515 gr Element_Count Elemcount $files | |
2516 gr Hash_Code Hashcode $files | |
2517 gr extcount bytecount $files | |
2518 gr bufpos charbpos $files | |
2519 gr bytind bytebpos $files | |
2520 gr memind membpos $files | |
2521 gr bufbyte intbyte $files | |
2522 gr Extcount Bytecount $files | |
2523 gr Bufpos Charbpos $files | |
2524 gr Bytind Bytebpos $files | |
2525 gr Memind Membpos $files | |
2526 gr Bufbyte Intbyte $files | |
2527 gr EXTCOUNT BYTECOUNT $files | |
2528 gr BUFPOS CHARBPOS $files | |
2529 gr BYTIND BYTEBPOS $files | |
2530 gr MEMIND MEMBPOS $files | |
2531 gr BUFBYTE INTBYTE $files | |
2532 gr MEMORY_COUNT BYTECOUNT $files | |
2533 gr LSTREAM_DATA_COUNT BYTECOUNT $files | |
2534 gr ELEMENT_COUNT ELEMCOUNT $files | |
2535 gr HASH_CODE HASHCODE $files | |
2536 ----------------------------------- cut ------------------------------------ | |
2537 @end example | |
2538 | |
2539 The @samp{gr} script, and the scripts it uses, are documented in | |
2540 @file{README.global-renaming}, because if placed in this file they would | |
2541 need to have their @@ characters doubled, meaning you couldn't easily | |
2542 cut and paste from the source. | |
2543 | |
2544 In addition to those programs, I needed to fix up a few other | |
2545 things, particularly relating to the duplicate definitions of | |
2546 types, now that some types merged with others. Specifically: | |
2547 | |
2548 @enumerate | |
2549 @item | |
2550 in @file{lisp.h}, removed duplicate declarations of Bytecount. The changed | |
2551 code should now look like this: (In each code snippet below, the first | |
2552 and last lines are the same as the original, as are all lines outside of | |
2553 those lines. That allows you to locate the section to be replaced, and | |
2554 replace the stuff in that section, verifying that there isn't anything | |
2555 new added that would need to be kept.) | |
2556 | |
2557 @example | |
2558 --------------------------------- snip ------------------------------------- | |
2559 /* Counts of bytes or chars */ | |
2560 typedef EMACS_INT Bytecount; | |
2561 typedef EMACS_INT Charcount; | |
2562 | |
2563 /* Counts of elements */ | |
2564 typedef EMACS_INT Elemcount; | |
2565 | |
2566 /* Hash codes */ | |
2567 typedef unsigned long Hashcode; | |
2568 | |
2569 /* ------------------------ dynamic arrays ------------------- */ | |
2570 --------------------------------- snip ------------------------------------- | |
2571 @end example | |
2572 | |
2573 @item | |
2574 in @file{lstream.h}, removed duplicate declaration of Bytecount. Rewrote the | |
2575 comment about this type. The changed code should now look like this: | |
2576 | |
2577 @example | |
2578 --------------------------------- snip ------------------------------------- | |
2579 #endif | |
2580 | |
2581 /* The have been some arguments over the what the type should be that | |
2582 specifies a count of bytes in a data block to be written out or read in, | |
2583 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions. | |
2584 Originally it was long, which worked fine; Martin "corrected" these to | |
2585 size_t and ssize_t on the grounds that this is theoretically cleaner and | |
2586 is in keeping with the C standards. Unfortunately, this practice is | |
2587 horribly error-prone due to design flaws in the way that mixed | |
2588 signed/unsigned arithmetic happens. In fact, by doing this change, | |
2589 Martin introduced a subtle but fatal error that caused the operation of | |
2590 sending large mail messages to the SMTP server under Windows to fail. | |
2591 By putting all values back to be signed, avoiding any signed/unsigned | |
2592 mixing, the bug immediately went away. The type then in use was | |
2593 Lstream_Data_Count, so that it be reverted cleanly if a vote came to | |
2594 that. Now it is Bytecount. | |
2595 | |
2596 Some earlier comments about why the type must be signed: This MUST BE | |
2597 SIGNED, since it also is used in functions that return the number of | |
2598 bytes actually read to or written from in an operation, and these | |
2599 functions can return -1 to signal error. | |
2600 | |
2601 Note that the standard Unix @code{read()} and @code{write()} functions define the | |
2602 count going in as a size_t, which is UNSIGNED, and the count going | |
2603 out as an ssize_t, which is SIGNED. This is a horrible design | |
2604 flaw. Not only is it highly likely to lead to logic errors when a | |
2605 -1 gets interpreted as a large positive number, but operations are | |
2606 bound to fail in all sorts of horrible ways when a number in the | |
2607 upper-half of the size_t range is passed in -- this number is | |
2608 unrepresentable as an ssize_t, so code that checks to see how many | |
2609 bytes are actually written (which is mandatory if you are dealing | |
2610 with certain types of devices) will get completely screwed up. | |
2611 | |
2612 --ben | |
2613 */ | |
2614 | |
2615 typedef enum lstream_buffering | |
2616 --------------------------------- snip ------------------------------------- | |
2617 @end example | |
2618 | |
2619 @item | |
2620 in @file{dumper.c}, there are four places, all inside of @code{switch()} statements, | |
2621 where XD_BYTECOUNT appears twice as a case tag. In each case, the two | |
2622 case blocks contain identical code, and you should *REMOVE THE SECOND* | |
2623 and leave the first. | |
2624 @end enumerate | |
2625 | |
2626 @node Text/Char Type Renaming, , Great Integral Type Renaming, Major Textual Changes | |
2627 @section Text/Char Type Renaming | |
2628 @cindex Text/Char Type Renaming | |
2629 @cindex type renaming, text/char | |
2630 @cindex renaming, text/char types | |
2631 | |
2632 The purpose of this was | |
2633 | |
2634 @enumerate | |
2635 @item | |
2636 To distinguish between ``charptr'' when it refers to operations on | |
2637 the pointer itself and when it refers to operations on text | |
2638 @item | |
2639 To use consistent naming for everything referring to internal format, i.e. | |
2640 @end enumerate | |
2641 | |
2642 @example | |
2643 Itext == text in internal format | |
2644 Ibyte == a byte in such text | |
2645 Ichar == a char as represented in internal character format | |
2646 @end example | |
2647 | |
2648 Thus e.g. | |
2649 | |
2650 @example | |
2651 set_charptr_emchar -> set_itext_ichar | |
2652 @end example | |
2653 | |
2654 This was done using a script like this: | |
2655 | |
2656 @example | |
2657 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
2658 gr Intbyte Ibyte $files | |
2659 gr INTBYTE IBYTE $files | |
2660 gr intbyte ibyte $files | |
2661 gr EMCHAR ICHAR $files | |
2662 gr emchar ichar $files | |
2663 gr Emchar Ichar $files | |
2664 gr INC_CHARPTR INC_IBYTEPTR $files | |
2665 gr DEC_CHARPTR DEC_IBYTEPTR $files | |
2666 gr VALIDATE_CHARPTR VALIDATE_IBYTEPTR $files | |
2667 gr valid_charptr valid_ibyteptr $files | |
2668 gr CHARPTR ITEXT $files | |
2669 gr charptr itext $files | |
2670 gr Charptr Itext $files | |
2671 @end example | |
2672 | |
2673 See above for the source to @samp{gr}. | |
2674 | |
2675 As in the integral-types change, there are pre and post tags before and | |
2676 after the change: | |
2677 | |
2678 @example | |
2679 pre-internal-format-textual-renaming | |
2680 post-internal-format-textual-renaming | |
2681 @end example | |
2682 | |
2683 When merging a large branch, follow the same sort of procedure | |
2684 documented above, using these tags -- essentially sync up to the pre | |
2685 tag, then apply the script yourself, then sync from the post tag to the | |
2686 present. You can probably do the same if you don't have a separate | |
2687 workspace, but do have lots of outstanding changes and you'd rather not | |
2688 just merge all the textual changes directly. Use something like this: | |
2689 | |
2690 (WARNING: I'm not a CVS guru; before trying this, or any large operation | |
2691 that might potentially mess things up, @strong{DEFINITELY} make a backup of | |
2692 your existing workspace.) | |
2693 | |
2694 @example | |
2695 cup -r pre-internal-format-textual-renaming | |
2696 <apply script> | |
2697 cup -A -j post-internal-format-textual-renaming -j HEAD | |
2698 @end example | |
2699 | |
2700 This might also work: | |
2701 | |
2702 @example | |
2703 cup -j pre-internal-format-textual-renaming | |
2704 <apply script> | |
2705 cup -j post-internal-format-textual-renaming -j HEAD | |
2706 @end example | |
2707 | |
2708 ben | |
2709 | |
2710 The following is a script to go in the opposite direction: | |
2711 | |
2712 @example | |
2713 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
2714 | |
2715 # Evidently Perl considers _ to be a word char ala \b, even though XEmacs | |
2716 # doesn't. We need to be careful here with ibyte/ichar because of words | |
2717 # like Richard, @code{eicharlen()}, multibyte, HIBYTE, etc. | |
2718 | |
2719 gr Ibyte Intbyte $files | |
2720 gr '\bIBYTE' INTBYTE $files | |
2721 gr '\bibyte' intbyte $files | |
2722 gr '\bICHAR' EMCHAR $files | |
2723 gr '\bichar' emchar $files | |
2724 gr '\bIchar' Emchar $files | |
2725 gr '\bIBYTEPTR' CHARPTR $files | |
2726 gr '\bibyteptr' charptr $files | |
2727 gr '\bITEXT' CHARPTR $files | |
2728 gr '\bitext' charptr $files | |
2729 gr '\bItext' CHARPTR $files | |
2730 | |
2731 gr '_IBYTE' _INTBYTE $files | |
2732 gr '_ibyte' _intbyte $files | |
2733 gr '_ICHAR' _EMCHAR $files | |
2734 gr '_ichar' _emchar $files | |
2735 gr '_Ichar' _Emchar $files | |
2736 gr '_IBYTEPTR' _CHARPTR $files | |
2737 gr '_ibyteptr' _charptr $files | |
2738 gr '_ITEXT' _CHARPTR $files | |
2739 gr '_itext' _charptr $files | |
2740 gr '_Itext' _CHARPTR $files | |
2741 @end example | |
2742 | |
2743 @node Rules When Writing New C Code, Regression Testing XEmacs, Major Textual Changes, Top | |
2744 @chapter Rules When Writing New C Code | |
2745 @cindex writing new C code, rules when | |
2746 @cindex C code, rules when writing new | |
2747 @cindex code, rules when writing new C | |
2748 | |
2749 The XEmacs C Code is extremely complex and intricate, and there are many | |
2750 rules that are more or less consistently followed throughout the code. | |
2751 Many of these rules are not obvious, so they are explained here. It is | |
2752 of the utmost importance that you follow them. If you don't, you may | |
2753 get something that appears to work, but which will crash in odd | |
2754 situations, often in code far away from where the actual breakage is. | |
2755 | |
2756 @menu | |
2757 * A Reader's Guide to XEmacs Coding Conventions:: | |
2758 * General Coding Rules:: | |
2759 * Object-Oriented Techniques for C:: | |
2760 * Writing Lisp Primitives:: | |
2761 * Writing Good Comments:: | |
2762 * Adding Global Lisp Variables:: | |
2763 * Writing Macros:: | |
2764 * Proper Use of Unsigned Types:: | |
2765 * Techniques for XEmacs Developers:: | |
2766 @end menu | |
2767 | |
2768 See also @ref{Coding for Mule}. | |
2769 | |
2770 @node A Reader's Guide to XEmacs Coding Conventions, General Coding Rules, Rules When Writing New C Code, Rules When Writing New C Code | |
2771 @section A Reader's Guide to XEmacs Coding Conventions | |
2772 @cindex coding conventions | |
2773 @cindex reader's guide | |
2774 @cindex coding rules, naming | |
2775 | |
2776 Of course the low-level implementation language of XEmacs is C, but much | |
2777 of that uses the Lisp engine to do its work. However, because the code | |
2778 is ``inside'' of the protective containment shell around the ``reactor | |
2779 core,'' you'll see lots of complex ``plumbing'' needed to do the work | |
2780 and ``safety mechanisms,'' whose failure results in a meltdown. This | |
2781 section provides a quick overview (or review) of the various components | |
2782 of the implementation of Lisp objects. | |
2783 | |
2784 Two typographic conventions help to identify C objects that implement | |
2785 Lisp objects. The first is that capitalized identifiers, especially | |
2786 beginning with the letters @samp{Q}, @samp{V}, @samp{F}, and @samp{S}, | |
2787 for C variables and functions, and C macros with beginning with the | |
2788 letter @samp{X}, are used to implement Lisp. The second is that where | |
2789 Lisp uses the hyphen @samp{-} in symbol names, the corresponding C | |
2790 identifiers use the underscore @samp{_}. Of course, since XEmacs Lisp | |
2791 contains interfaces to many external libraries, those external names | |
2792 will follow the coding conventions their authors chose, and may overlap | |
2793 the ``XEmacs name space.'' However these cases are usually pretty | |
2794 obvious. | |
2795 | |
2796 All Lisp objects are handled indirectly. The @code{Lisp_Object} | |
2797 type is usually a pointer to a structure, except for a very small number | |
2798 of types with immediate representations (currently characters and | |
2799 integers). However, these types cannot be directly operated on in C | |
2800 code, either, so they can also be considered indirect. Types that do | |
2801 not have an immediate representation always have a C typedef | |
2802 @code{Lisp_@var{type}} for a corresponding structure. | |
2803 @c #### mention l(c)records here? | |
2804 | |
2805 In older code, it was common practice to pass around pointers to | |
2806 @code{Lisp_@var{type}}, but this is now deprecated in favor of using | |
2807 @code{Lisp_Object} for all function arguments and return values that are | |
2808 Lisp objects. The @code{X@var{type}} macro is used to extract the | |
2809 pointer and cast it to @code{(Lisp_@var{type} *)} for the desired type. | |
2810 | |
2811 @strong{Convention}: macros whose names begin with @samp{X} operate on | |
2812 @code{Lisp_Object}s and do no type-checking. Many such macros are type | |
2813 extractors, but others implement Lisp operations in C (@emph{e.g.}, | |
2814 @code{XCAR} implements the Lisp @code{car} function). These are unsafe, | |
2815 and must only be used where types of all data have already been checked. | |
2816 Such macros are only applied to @code{Lisp_Object}s. In internal | |
2817 implementations where the pointer has already been converted, the | |
2818 structure is operated on directly using the C @code{->} member access | |
2819 operator. | |
2820 | |
2821 The @code{@var{type}P}, @code{CHECK_@var{type}}, and | |
2822 @code{CONCHECK_@var{type}} macros are used to test types. The first | |
2823 returns a Boolean value, and the latter signal errors. (The | |
2824 @samp{CONCHECK} variety allows execution to be CONtinued under some | |
2825 circumstances, thus the name.) Functions which expect to be passed user | |
2826 data invariably call @samp{CHECK} macros on arguments. | |
2827 | |
2828 There are many types of specialized Lisp objects implemented in C, but | |
2829 the most pervasive type is the @dfn{symbol}. Symbols are used as | |
2830 identifiers, variables, and functions. | |
2831 | |
2832 @strong{Convention}: Global variables whose names begin with @samp{Q} | |
2833 are constants whose value is a symbol. The name of the variable should | |
2834 be derived from the name of the symbol using the same rules as for Lisp | |
2835 primitives. Such variables allow the C code to check whether a | |
2836 particular @code{Lisp_Object} is equal to a given symbol. Symbols are | |
2837 Lisp objects, so these variables may be passed to Lisp primitives. (An | |
2838 alternative to the use of @samp{Q...} variables is to call the | |
2839 @code{intern} function at initialization in the | |
2840 @code{vars_of_@var{module}} function, which is hardly less efficient.) | |
2841 | |
2842 @strong{Convention}: Global variables whose names begin with @samp{V} | |
2843 are variables that contain Lisp objects. The convention here is that | |
2844 all global variables of type @code{Lisp_Object} begin with @samp{V}, and | |
2845 no others do (not even integer and boolean variables that have Lisp | |
2846 equivalents). Most of the time, these variables have equivalents in | |
2847 Lisp, which are defined via the @samp{DEFVAR} family of macros, but some | |
2848 don't. Since the variable's value is a @code{Lisp_Object}, it can be | |
2849 passed to Lisp primitives. | |
2850 | |
2851 The implementation of Lisp primitives is more complex. | |
2852 @strong{Convention}: Global variables with names beginning with @samp{S} | |
2853 contain a structure that allows the Lisp engine to identify and call a C | |
2854 function. In modern versions of XEmacs, these identifiers are almost | |
2855 always completely hidden in the @code{DEFUN} and @code{SUBR} macros, but | |
2856 you will encounter them if you look at very old versions of XEmacs or at | |
2857 GNU Emacs. @strong{Convention}: Functions with names beginning with | |
2858 @samp{F} implement Lisp primitives. Of course all their arguments and | |
2859 their return values must be Lisp_Objects. (This is hidden in the | |
2860 @code{DEFUN} macro.) | |
2861 | |
2862 | |
2863 @node General Coding Rules, Object-Oriented Techniques for C, A Reader's Guide to XEmacs Coding Conventions, Rules When Writing New C Code | |
2864 @section General Coding Rules | |
2865 @cindex coding rules, general | |
2866 | |
2867 The C code is actually written in a dialect of C called @dfn{Clean C}, | |
2868 meaning that it can be compiled, mostly warning-free, with either a C or | |
2869 C++ compiler. Coding in Clean C has several advantages over plain C. | |
2870 C++ compilers are more nit-picking, and a number of coding errors have | |
2871 been found by compiling with C++. The ability to use both C and C++ | |
2872 tools means that a greater variety of development tools are available to | |
2873 the developer. In addition, the ability to overload operators in C++ | |
2874 means it is possible, for error-checking purposes, to redefine certain | |
2875 simple types (normally defined as aliases for simple built-in types such | |
2876 as @code{unsigned char} or @code{long}) as classes, strictly limiting the permissible | |
2877 operations and catching illegal implicit casts and such. | |
2878 | |
2879 Every module includes @file{<config.h>} (angle brackets so that | |
2880 @samp{--srcdir} works correctly; @file{config.h} may or may not be in | |
2881 the same directory as the C sources) and @file{lisp.h}. @file{config.h} | |
2882 must always be included before any other header files (including | |
2883 system header files) to ensure that certain tricks played by various | |
2884 @file{s/} and @file{m/} files work out correctly. | |
2885 | |
2886 When including header files, always use angle brackets, not double | |
2887 quotes, except when the file to be included is always in the same | |
2888 directory as the including file. If either file is a generated file, | |
2889 then that is not likely to be the case. In order to understand why we | |
2890 have this rule, imagine what happens when you do a build in the source | |
2891 directory using @samp{./configure} and another build in another | |
2892 directory using @samp{../work/configure}. There will be two different | |
2893 @file{config.h} files. Which one will be used if you @samp{#include | |
2894 "config.h"}? | |
2895 | |
2896 Almost every module contains a @code{syms_of_*()} function and a | |
2897 @code{vars_of_*()} function. The former declares any Lisp primitives | |
2898 you have defined and defines any symbols you will be using. The latter | |
2899 declares any global Lisp variables you have added and initializes global | |
2900 C variables in the module. @strong{Important}: There are stringent | |
2901 requirements on exactly what can go into these functions. See the | |
2902 comment in @file{emacs.c}. The reason for this is to avoid obscure | |
2903 unwanted interactions during initialization. If you don't follow these | |
2904 rules, you'll be sorry! If you want to do anything that isn't allowed, | |
2905 create a @code{complex_vars_of_*()} function for it. Doing this is | |
2906 tricky, though: you have to make sure your function is called at the | |
2907 right time so that all the initialization dependencies work out. | |
2908 | |
2909 Declare each function of these kinds in @file{symsinit.h}. Make sure | |
2910 it's called in the appropriate place in @file{emacs.c}. You never need | |
2911 to include @file{symsinit.h} directly, because it is included by | |
2912 @file{lisp.h}. | |
2913 | |
2914 @strong{All global and static variables that are to be modifiable must | |
2915 be declared uninitialized.} This means that you may not use the | |
2916 ``declare with initializer'' form for these variables, such as @code{int | |
2917 some_variable = 0;}. The reason for this has to do with some kludges | |
2918 done during the dumping process: If possible, the initialized data | |
2919 segment is re-mapped so that it becomes part of the (unmodifiable) code | |
2920 segment in the dumped executable. This allows this memory to be shared | |
2921 among multiple running XEmacs processes. XEmacs is careful to place as | |
2922 much constant data as possible into initialized variables during the | |
2923 @file{temacs} phase. | |
2924 | |
2925 @cindex copy-on-write | |
2926 @strong{Please note:} This kludge only works on a few systems nowadays, | |
2927 and is rapidly becoming irrelevant because most modern operating systems | |
2928 provide @dfn{copy-on-write} semantics. All data is initially shared | |
2929 between processes, and a private copy is automatically made (on a | |
2930 page-by-page basis) when a process first attempts to write to a page of | |
2931 memory. | |
2932 | |
2933 Formerly, there was a requirement that static variables not be declared | |
2934 inside of functions. This had to do with another hack along the same | |
2935 vein as what was just described: old USG systems put statically-declared | |
2936 variables in the initialized data space, so those header files had a | |
2937 @code{#define static} declaration. (That way, the data-segment remapping | |
2938 described above could still work.) This fails badly on static variables | |
2939 inside of functions, which suddenly become automatic variables; | |
2940 therefore, you weren't supposed to have any of them. This awful kludge | |
2941 has been removed in XEmacs because | |
2942 | |
2943 @enumerate | |
2944 @item | |
2945 almost all of the systems that used this kludge ended up having | |
2946 to disable the data-segment remapping anyway; | |
2947 @item | |
2948 the only systems that didn't were extremely outdated ones; | |
2949 @item | |
2950 this hack completely messed up inline functions. | |
2951 @end enumerate | |
2952 | |
2953 The C source code makes heavy use of C preprocessor macros. One popular | |
2954 macro style is: | |
2955 | |
2956 @example | |
2957 #define FOO(var, value) do @{ \ | |
2958 Lisp_Object FOO_value = (value); \ | |
2959 ... /* compute using FOO_value */ \ | |
2960 (var) = bar; \ | |
2961 @} while (0) | |
2962 @end example | |
2963 | |
2964 The @code{do @{...@} while (0)} is a standard trick to allow FOO to have | |
2965 statement semantics, so that it can safely be used within an @code{if} | |
2966 statement in C, for example. Multiple evaluation is prevented by | |
2967 copying a supplied argument into a local variable, so that | |
2968 @code{FOO(var,fun(1))} only calls @code{fun} once. | |
2969 | |
2970 Lisp lists are popular data structures in the C code as well as in | |
2971 Elisp. There are two sets of macros that iterate over lists. | |
2972 @code{EXTERNAL_LIST_LOOP_@var{n}} should be used when the list has been | |
2973 supplied by the user, and cannot be trusted to be acyclic and | |
2974 @code{nil}-terminated. A @code{malformed-list} or @code{circular-list} error | |
2975 will be generated if the list being iterated over is not entirely | |
2976 kosher. @code{LIST_LOOP_@var{n}}, on the other hand, is faster and less | |
2977 safe, and can be used only on trusted lists. | |
2978 | |
2979 Related macros are @code{GET_EXTERNAL_LIST_LENGTH} and | |
2980 @code{GET_LIST_LENGTH}, which calculate the length of a list, and in the | |
2981 case of @code{GET_EXTERNAL_LIST_LENGTH}, validating the properness of | |
2982 the list. The macros @code{EXTERNAL_LIST_LOOP_DELETE_IF} and | |
2983 @code{LIST_LOOP_DELETE_IF} delete elements from a lisp list satisfying some | |
2984 predicate. | |
2985 | |
2986 @node Object-Oriented Techniques for C, Writing Lisp Primitives, General Coding Rules, Rules When Writing New C Code | |
2987 @section Object-Oriented Techniques for C | |
2988 @cindex coding rules, object-oriented | |
2989 @cindex object-oriented techniques | |
2990 | |
2991 At the lowest levels, XEmacs makes heavy use of object-oriented | |
2992 techniques to promote code-sharing and uniform interfaces for different | |
2993 devices and platforms. Commonly, but not always, such objects are | |
2994 ``wrapped'' and exported to Lisp as Lisp objects. Usually they use | |
2995 the internal structures developed for Lisp objects (the @samp{lrecord} | |
2996 structure) in order to take advantage of Lisp memory management. | |
2997 Unfortunately, XEmacs was originally written in C, so these techniques | |
2998 are based on heavy use of C macros. | |
2999 | |
3000 @c You can't use @var{} for type below, because case is important. | |
3001 A module defining a class is likely to use most of the following | |
3002 declarations and macros. In the following, the notation @samp{<type>} | |
3003 will stand for the full name of the class, and will be capitalized in | |
3004 the way normal for its context. The notation @samp{<typ>} will stand | |
3005 for the abbreviated form commonly used in macro names, while @samp{ty} | |
3006 will be used as the typical name for instances of the class. (See the | |
3007 entry for @samp{MAYBE_<TY>METH} below for an example using all three | |
3008 notations.) | |
3009 | |
3010 In the interface (@file{.h} file), the following declarations are used | |
3011 often. Others may be used in for particular modules. Since they're | |
3012 quite short in most cases, the definitions are given as well. The | |
3013 generic macros used are defined in @file{lisp.h} or @file{lrecord.h}. | |
3014 | |
3015 @c #### reorganize this table into stuff used in general code, and stuff | |
3016 @c used only in declarations or initializations | |
3017 @table @samp | |
3018 @c #### declaration | |
3019 @item typedef struct Lisp_<Type> Lisp_<Type> | |
3020 This refers to the internal structure used by C code. The XEmacs coding | |
3021 style now forbids passing pointers to @samp{Lisp_<Type>} structures into | |
3022 or out of a function; instead, a @samp{Lisp_Object} should be passed or | |
3023 returned (created using @samp{wrap_<type>}, if necessary). | |
3024 | |
3025 @c #### declaration | |
3026 @item DECLARE_LRECORD (<type>, Lisp_<Type>) | |
3027 Declares an @samp{lrecord} for @samp{<Type>}, which is the unit of | |
3028 allocation. | |
3029 | |
3030 @item #define X<TYPE>(x) XRECORD (x, <type>, Lisp_<Type>) | |
3031 Turns a @code{Lisp_Object} into a pointer to @samp{struct Lisp_<Type>}. | |
3032 | |
3033 @item #define wrap_<type>(p) wrap_record (p, <type>) | |
3034 Turns a pointer to @samp{struct Lisp_<Type>} into a @code{Lisp_Object}. | |
3035 | |
3036 @item #define <TYPE>P(x) RECORDP (x, <type>) | |
3037 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}. | |
3038 Returns a C int, not a Lisp Boolean value. | |
3039 | |
3040 @item #define CHECK_<TYPE>(x) CHECK_RECORD (x, <type>) | |
3041 @itemx #define CONCHECK_<TYPE>(x) CONCHECK_RECORD (x, <type>) | |
3042 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}, | |
3043 and signals a Lisp error if not. The @samp{CHECK} version of the macro | |
3044 never returns if the type is wrong, while the @samp{CONCHECK} version | |
3045 can return if the user catches it in the debugger and explicitly | |
3046 requests a return. | |
3047 | |
3048 @item #define RAW_<TYP>METH(ty, m) ((ty)->methods->m##_method) | |
3049 Return a function pointer for the method for an object @var{TY} of class | |
3050 @samp{Lisp_<Type>}, or @samp{NULL} if there is none for this type. | |
3051 | |
3052 @item #define HAS_<TYP>METH_P(ty, m) (!!RAW_<TYP>METH (ty, m)) | |
3053 Test whether the class that @var{TY} is an instance of has the method. | |
3054 | |
3055 @item #define <TYP>METH(ty, m, args) ((RAW_<TYP>METH (ty, m)) args) | |
3056 Call the method on @samp{args}. @samp{args} must be enclosed in | |
3057 parentheses in the call. It is the programmer's responsibility to | |
3058 ensure that the method is available. The standard convenience macro | |
3059 @samp{MAYBE_<TYP>METH} is often provided for the common case where a | |
3060 void-returning method of @samp{Type} is called. | |
3061 | |
3062 @item #define MAYBE_<TYP>METH(ty, m, args) do @{ ... @} while (0) | |
3063 Call a void-returning @samp{<Type>} method, if it exists. Note the use | |
3064 of the @samp{do ... while (0)} idiom to give the macro call C statement | |
3065 semantics. The full definition is equally idiomatic: | |
3066 | |
3067 @example | |
3068 #define MAYBE_<TYP>METH(ty, m, args) do @{ \ | |
3069 Lisp_<Type> *maybe_<typ>meth_ty = (ty); \ | |
3070 if (HAS_<TYP>METH_P (maybe_<typ>meth_ty, m)) \ | |
3071 <TYP>METH (maybe_<typ>meth_ty, m, args); \ | |
3072 @} while (0) | |
3073 @end example | |
3074 @end table | |
3075 | |
3076 The use of macros for invoking an object's methods makes life a bit | |
3077 difficult for the student or maintainer when browsing the code. In | |
3078 particular, calls are of the form @samp{<TYP>METH (ty, some_method, (x, | |
3079 y))}, but definitions typically are for @samp{<subtype>_some_method}. | |
3080 Thus, when you are trying to find calls, you need to grep for | |
3081 @samp{some_method}, but this will also catch calls and definitions of | |
3082 that method for instances of other subtypes of @samp{<Type>}, and there | |
3083 may be a rather large number of them. | |
3084 | |
3085 | |
3086 @node Writing Lisp Primitives, Writing Good Comments, Object-Oriented Techniques for C, Rules When Writing New C Code | |
3087 @section Writing Lisp Primitives | |
3088 @cindex writing Lisp primitives | |
3089 @cindex Lisp primitives, writing | |
3090 @cindex primitives, writing Lisp | |
3091 | |
3092 Lisp primitives are Lisp functions implemented in C. The details of | |
3093 interfacing the C function so that Lisp can call it are handled by a few | |
3094 C macros. The only way to really understand how to write new C code is | |
3095 to read the source, but we can explain some things here. | |
3096 | |
3097 An example of a special form is the definition of @code{prog1}, from | |
3098 @file{eval.c}. (An ordinary function would have the same general | |
3099 appearance.) | |
3100 | |
3101 @cindex garbage collection protection | |
3102 @smallexample | |
3103 @group | |
3104 DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /* | |
3105 Similar to `progn', but the value of the first form is returned. | |
3106 \(prog1 FIRST BODY...): All the arguments are evaluated sequentially. | |
3107 The value of FIRST is saved during evaluation of the remaining args, | |
3108 whose values are discarded. | |
3109 */ | |
3110 (args)) | |
3111 @{ | |
3112 /* This function can GC */ | |
3113 REGISTER Lisp_Object val, form, tail; | |
3114 struct gcpro gcpro1; | |
3115 | |
3116 val = Feval (XCAR (args)); | |
3117 | |
3118 GCPRO1 (val); | |
3119 | |
3120 LIST_LOOP_3 (form, XCDR (args), tail) | |
3121 Feval (form); | |
3122 | |
3123 UNGCPRO; | |
3124 return val; | |
3125 @} | |
3126 @end group | |
3127 @end smallexample | |
3128 | |
3129 Let's start with a precise explanation of the arguments to the | |
3130 @code{DEFUN} macro. Here is a template for them: | |
3131 | |
3132 @example | |
3133 @group | |
3134 DEFUN (@var{lname}, @var{fname}, @var{min_args}, @var{max_args}, @var{interactive}, /* | |
3135 @var{docstring} | |
3136 */ | |
3137 (@var{arglist})) | |
3138 @end group | |
3139 @end example | |
3140 | |
3141 @table @var | |
3142 @item lname | |
3143 This string is the name of the Lisp symbol to define as the function | |
3144 name; in the example above, it is @code{"prog1"}. | |
3145 | |
3146 @item fname | |
3147 This is the C function name for this function. This is the name that is | |
3148 used in C code for calling the function. The name is, by convention, | |
3149 @samp{F} prepended to the Lisp name, with all dashes (@samp{-}) in the | |
3150 Lisp name changed to underscores. Thus, to call this function from C | |
3151 code, call @code{Fprog1}. Remember that the arguments are of type | |
3152 @code{Lisp_Object}; various macros and functions for creating values of | |
3153 type @code{Lisp_Object} are declared in the file @file{lisp.h}. | |
3154 | |
3155 Primitives whose names are special characters (e.g. @code{+} or | |
3156 @code{<}) are named by spelling out, in some fashion, the special | |
3157 character: e.g. @code{Fplus()} or @code{Flss()}. Primitives whose names | |
3158 begin with normal alphanumeric characters but also contain special | |
3159 characters are spelled out in some creative way, e.g. @code{let*} | |
3160 becomes @code{FletX()}. | |
3161 | |
3162 Each function also has an associated structure that holds the data for | |
3163 the subr object that represents the function in Lisp. This structure | |
3164 conveys the Lisp symbol name to the initialization routine that will | |
3165 create the symbol and store the subr object as its definition. The C | |
3166 variable name of this structure is always @samp{S} prepended to the | |
3167 @var{fname}. You hardly ever need to be aware of the existence of this | |
3168 structure, since @code{DEFUN} plus @code{DEFSUBR} takes care of all the | |
3169 details. | |
3170 | |
3171 @item min_args | |
3172 This is the minimum number of arguments that the function requires. The | |
3173 function @code{prog1} allows a minimum of one argument. | |
3174 | |
3175 @item max_args | |
3176 This is the maximum number of arguments that the function accepts, if | |
3177 there is a fixed maximum. Alternatively, it can be @code{UNEVALLED}, | |
3178 indicating a special form that receives unevaluated arguments, or | |
3179 @code{MANY}, indicating an unlimited number of evaluated arguments (the | |
3180 C equivalent of @code{&rest}). Both @code{UNEVALLED} and @code{MANY} | |
3181 are macros. If @var{max_args} is a number, it may not be less than | |
3182 @var{min_args} and it may not be greater than 8. (If you need to add a | |
3183 function with more than 8 arguments, use the @code{MANY} form. Resist | |
3184 the urge to edit the definition of @code{DEFUN} in @file{lisp.h}. If | |
3185 you do it anyways, make sure to also add another clause to the switch | |
3186 statement in @code{primitive_funcall().}) | |
3187 | |
3188 @item interactive | |
3189 This is an interactive specification, a string such as might be used as | |
3190 the argument of @code{interactive} in a Lisp function. In the case of | |
3191 @code{prog1}, it is 0 (a null pointer), indicating that @code{prog1} | |
3192 cannot be called interactively. A value of @code{""} indicates a | |
3193 function that should receive no arguments when called interactively. | |
3194 | |
3195 @item docstring | |
3196 This is the documentation string. It is written just like a | |
3197 documentation string for a function defined in Lisp; in particular, the | |
3198 first line should be a single sentence. Note how the documentation | |
3199 string is enclosed in a comment, none of the documentation is placed on | |
3200 the same lines as the comment-start and comment-end characters, and the | |
3201 comment-start characters are on the same line as the interactive | |
3202 specification. @file{make-docfile}, which scans the C files for | |
3203 documentation strings, is very particular about what it looks for, and | |
3204 will not properly extract the doc string if it's not in this exact format. | |
3205 | |
3206 In order to make both @file{etags} and @file{make-docfile} happy, make | |
3207 sure that the @code{DEFUN} line contains the @var{lname} and | |
3208 @var{fname}, and that the comment-start characters for the doc string | |
3209 are on the same line as the interactive specification, and put a newline | |
3210 directly after them (and before the comment-end characters). | |
3211 | |
3212 @item arglist | |
3213 This is the comma-separated list of arguments to the C function. For a | |
3214 function with a fixed maximum number of arguments, provide a C argument | |
3215 for each Lisp argument. In this case, unlike regular C functions, the | |
3216 types of the arguments are not declared; they are simply always of type | |
3217 @code{Lisp_Object}. | |
3218 | |
3219 The names of the C arguments will be used as the names of the arguments | |
3220 to the Lisp primitive as displayed in its documentation, modulo the same | |
3221 concerns described above for @code{F...} names (in particular, | |
3222 underscores in the C arguments become dashes in the Lisp arguments). | |
3223 | |
3224 There is one additional kludge: A trailing @samp{_} on the C argument is | |
3225 discarded when forming the Lisp argument. This allows C language | |
3226 reserved words (like @code{default}) or global symbols (like | |
3227 @code{dirname}) to be used as argument names without compiler warnings | |
3228 or errors. | |
3229 | |
3230 A Lisp function with @w{@var{max_args} = @code{UNEVALLED}} is a | |
3231 @w{@dfn{special form}}; its arguments are not evaluated. Instead it | |
3232 receives one argument of type @code{Lisp_Object}, a (Lisp) list of the | |
3233 unevaluated arguments, conventionally named @code{(args)}. | |
3234 | |
3235 When a Lisp function has no upper limit on the number of arguments, | |
3236 specify @w{@var{max_args} = @code{MANY}}. In this case its implementation in | |
3237 C actually receives exactly two arguments: the number of Lisp arguments | |
3238 (an @code{int}) and the address of a block containing their values (a | |
3239 @w{@code{Lisp_Object *}}). In this case only are the C types specified | |
3240 in the @var{arglist}: @w{@code{(int nargs, Lisp_Object *args)}}. | |
3241 | |
3242 @end table | |
3243 | |
3244 Within the function @code{Fprog1} itself, note the use of the macros | |
3245 @code{GCPRO1} and @code{UNGCPRO}. @code{GCPRO1} is used to ``protect'' | |
3246 a variable from garbage collection---to inform the garbage collector | |
3247 that it must look in that variable and regard the object pointed at by | |
3248 its contents as an accessible object. This is necessary whenever you | |
3249 call @code{Feval} or anything that can directly or indirectly call | |
3250 @code{Feval} (this includes the @code{QUIT} macro!). At such a time, | |
3251 any Lisp object that you intend to refer to again must be protected | |
3252 somehow. @code{UNGCPRO} cancels the protection of the variables that | |
3253 are protected in the current function. It is necessary to do this | |
3254 explicitly. | |
3255 | |
3256 The macro @code{GCPRO1} protects just one local variable. If you want | |
3257 to protect two, use @code{GCPRO2} instead; repeating @code{GCPRO1} will | |
3258 not work. Macros @code{GCPRO3} and @code{GCPRO4} also exist. | |
3259 | |
3260 These macros implicitly use local variables such as @code{gcpro1}; you | |
3261 must declare these explicitly, with type @code{struct gcpro}. Thus, if | |
3262 you use @code{GCPRO2}, you must declare @code{gcpro1} and @code{gcpro2}. | |
3263 | |
3264 @cindex caller-protects (@code{GCPRO} rule) | |
3265 Note also that the general rule is @dfn{caller-protects}; i.e. you are | |
3266 only responsible for protecting those Lisp objects that you create. Any | |
3267 objects passed to you as arguments should have been protected by whoever | |
3268 created them, so you don't in general have to protect them. | |
3269 | |
3270 In particular, the arguments to any Lisp primitive are always | |
3271 automatically @code{GCPRO}ed, when called ``normally'' from Lisp code or | |
3272 bytecode. So only a few Lisp primitives that are called frequently from | |
3273 C code, such as @code{Fprogn} protect their arguments as a service to | |
3274 their caller. You don't need to protect your arguments when writing a | |
3275 new @code{DEFUN}. | |
3276 | |
3277 @code{GCPRO}ing is perhaps the trickiest and most error-prone part of | |
3278 XEmacs coding. It is @strong{extremely} important that you get this | |
3279 right and use a great deal of discipline when writing this code. | |
3280 @xref{GCPROing, ,@code{GCPRO}ing}, for full details on how to do this. | |
3281 | |
3282 What @code{DEFUN} actually does is declare a global structure of type | |
3283 @code{Lisp_Subr} whose name begins with capital @samp{SF} and which | |
3284 contains information about the primitive (e.g. a pointer to the | |
3285 function, its minimum and maximum allowed arguments, a string describing | |
3286 its Lisp name); @code{DEFUN} then begins a normal C function declaration | |
3287 using the @code{F...} name. The Lisp subr object that is the function | |
3288 definition of a primitive (i.e. the object in the function slot of the | |
3289 symbol that names the primitive) actually points to this @samp{SF} | |
3290 structure; when @code{Feval} encounters a subr, it looks in the | |
3291 structure to find out how to call the C function. | |
3292 | |
3293 Defining the C function is not enough to make a Lisp primitive | |
3294 available; you must also create the Lisp symbol for the primitive (the | |
3295 symbol is @dfn{interned}; @pxref{Obarrays}) and store a suitable subr | |
3296 object in its function cell. (If you don't do this, the primitive won't | |
3297 be seen by Lisp code.) The code looks like this: | |
3298 | |
3299 @example | |
3300 DEFSUBR (@var{fname}); | |
3301 @end example | |
3302 | |
3303 @noindent | |
3304 Here @var{fname} is the same name you used as the second argument to | |
3305 @code{DEFUN}. | |
3306 | |
3307 This call to @code{DEFSUBR} should go in the @code{syms_of_*()} function | |
3308 at the end of the module. If no such function exists, create it and | |
3309 make sure to also declare it in @file{symsinit.h} and call it from the | |
3310 appropriate spot in @code{main()}. @xref{General Coding Rules}. | |
3311 | |
3312 Note that C code cannot call functions by name unless they are defined | |
3313 in C. The way to call a function written in Lisp from C is to use | |
3314 @code{Ffuncall}, which embodies the Lisp function @code{funcall}. Since | |
3315 the Lisp function @code{funcall} accepts an unlimited number of | |
3316 arguments, in C it takes two: the number of Lisp-level arguments, and a | |
3317 one-dimensional array containing their values. The first Lisp-level | |
3318 argument is the Lisp function to call, and the rest are the arguments to | |
3319 pass to it. Since @code{Ffuncall} can call the evaluator, you must | |
3320 protect pointers from garbage collection around the call to | |
3321 @code{Ffuncall}. (However, @code{Ffuncall} explicitly protects all of | |
3322 its parameters, so you don't have to protect any pointers passed as | |
3323 parameters to it.) | |
3324 | |
3325 The C functions @code{call0}, @code{call1}, @code{call2}, and so on, | |
3326 provide handy ways to call a Lisp function conveniently with a fixed | |
3327 number of arguments. They work by calling @code{Ffuncall}. | |
3328 | |
3329 @file{eval.c} is a very good file to look through for examples; | |
3330 @file{lisp.h} contains the definitions for important macros and | |
3331 functions. | |
3332 | |
3333 @node Writing Good Comments, Adding Global Lisp Variables, Writing Lisp Primitives, Rules When Writing New C Code | |
3334 @section Writing Good Comments | |
3335 @cindex writing good comments | |
3336 @cindex comments, writing good | |
3337 | |
3338 Comments are a lifeline for programmers trying to understand tricky | |
3339 code. In general, the less obvious it is what you are doing, the more | |
3340 you need a comment, and the more detailed it needs to be. You should | |
3341 always be on guard when you're writing code for stuff that's tricky, and | |
3342 should constantly be putting yourself in someone else's shoes and asking | |
3343 if that person could figure out without much difficulty what's going | |
3344 on. (Assume they are a competent programmer who understands the | |
3345 essentials of how the XEmacs code is structured but doesn't know much | |
3346 about the module you're working on or any algorithms you're using.) If | |
3347 you're not sure whether they would be able to, add a comment. Always | |
3348 err on the side of more comments, rather than less. | |
3349 | |
3350 Generally, when making comments, there is no need to attribute them with | |
3351 your name or initials. This especially goes for small, | |
3352 easy-to-understand, non-opinionated ones. Also, comments indicating | |
3353 where, when, and by whom a file was changed are @emph{strongly} | |
3354 discouraged, and in general will be removed as they are discovered. | |
3355 This is exactly what @file{ChangeLogs} are there for. However, it can | |
3356 occasionally be useful to mark exactly where (but not when or by whom) | |
3357 changes are made, particularly when making small changes to a file | |
3358 imported from elsewhere. These marks help when later on a newer version | |
3359 of the file is imported and the changes need to be merged. (If | |
3360 everything were always kept in CVS, there would be no need for this. | |
3361 But in practice, this often doesn't happen, or the CVS repository is | |
3362 later on lost or unavailable to the person doing the update.) | |
3363 | |
3364 When putting in an explicit opinion in a comment, you should | |
3365 @emph{always} attribute it with your name and the date. This also goes | |
3366 for long, complex comments explaining in detail the workings of | |
3367 something -- by putting your name there, you make it possible for | |
3368 someone who has questions about how that thing works to determine who | |
3369 wrote the comment so they can write to them. Use your actual name or | |
3370 your alias at xemacs.org, and not your initials or nickname, unless that | |
3371 is generally recognized (e.g. @samp{jwz}). Even then, please consider | |
3372 requesting a virtual user at xemacs.org (forwarding address; we can't | |
3373 provide an actual mailbox). Otherwise, give first and last name. If | |
3374 you're not a regular contributor, you might consider putting your email | |
3375 address in -- it may be in the ChangeLog, but after awhile ChangeLogs | |
3376 have a tendency of disappearing or getting muddled. (E.g. your comment | |
3377 may get copied somewhere else or even into another program, and tracking | |
3378 down the proper ChangeLog may be very difficult.) | |
3379 | |
3380 If you come across an opinion that is not or is no longer valid, or you | |
3381 come across any comment that no longer applies but you want to keep it | |
3382 around, enclose it in @samp{[[ } and @samp{ ]]} marks and add a comment | |
3383 afterwards explaining why the preceding comment is no longer valid. Put | |
3384 your name on this comment, as explained above. | |
3385 | |
3386 Just as comments are a lifeline to programmers, incorrect comments are | |
3387 death. If you come across an incorrect comment, @strong{immediately} | |
3388 correct it or flag it as incorrect, as described in the previous | |
3389 paragraph. Whenever you work on a section of code, @emph{always} make | |
3390 sure to update any comments to be correct -- or, at the very least, flag | |
3391 them as incorrect. | |
3392 | |
3393 To indicate a "todo" or other problem, use four pound signs -- | |
3394 i.e. @samp{####}. | |
3395 | |
3396 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code | |
3397 @section Adding Global Lisp Variables | |
3398 @cindex global Lisp variables, adding | |
3399 @cindex variables, adding global Lisp | |
3400 | |
3401 Global variables whose names begin with @samp{Q} are constants whose | |
3402 value is a symbol of a particular name. The name of the variable should | |
3403 be derived from the name of the symbol using the same rules as for Lisp | |
3404 primitives. These variables are initialized using a call to | |
3405 @code{defsymbol()} in the @code{syms_of_*()} function. (This call | |
3406 interns a symbol, sets the C variable to the resulting Lisp object, and | |
3407 calls @code{staticpro()} on the C variable to tell the | |
3408 garbage-collection mechanism about this variable. What | |
3409 @code{staticpro()} does is add a pointer to the variable to a large | |
3410 global array; when garbage-collection happens, all pointers listed in | |
3411 the array are used as starting points for marking Lisp objects. This is | |
3412 important because it's quite possible that the only current reference to | |
3413 the object is the C variable. In the case of symbols, the | |
3414 @code{staticpro()} doesn't matter all that much because the symbol is | |
3415 contained in @code{obarray}, which is itself @code{staticpro()}ed. | |
3416 However, it's possible that a naughty user could do something like | |
3417 uninterning the symbol out of @code{obarray} or even setting | |
3418 @code{obarray} to a different value [although this is likely to make | |
3419 XEmacs crash!].) | |
3420 | |
3421 @strong{Please note:} It is potentially deadly if you declare a | |
3422 @samp{Q...} variable in two different modules. The two calls to | |
3423 @code{defsymbol()} are no problem, but some linkers will complain about | |
3424 multiply-defined symbols. The most insidious aspect of this is that | |
3425 often the link will succeed anyway, but then the resulting executable | |
3426 will sometimes crash in obscure ways during certain operations! | |
3427 | |
3428 To avoid this problem, declare any symbols with common names (such as | |
3429 @code{text}) that are not obviously associated with this particular | |
3430 module in the file @file{general-slots.h}. The ``-slots'' suffix | |
3431 indicates that this is a file that is included multiple times in | |
3432 @file{general.c}. Redefinition of preprocessor macros allows the | |
3433 effects to be different in each context, so this is actually more | |
3434 convenient and less error-prone than doing it in your module. | |
3435 | |
3436 Global variables whose names begin with @samp{V} are variables that | |
3437 contain Lisp objects. The convention here is that all global variables | |
3438 of type @code{Lisp_Object} begin with @samp{V}, and all others don't | |
3439 (including integer and boolean variables that have Lisp | |
3440 equivalents). Most of the time, these variables have equivalents in | |
3441 Lisp, but some don't. Those that do are declared this way by a call to | |
3442 @code{DEFVAR_LISP()} in the @code{vars_of_*()} initializer for the | |
3443 module. What this does is create a special @dfn{symbol-value-forward} | |
3444 Lisp object that contains a pointer to the C variable, intern a symbol | |
3445 whose name is as specified in the call to @code{DEFVAR_LISP()}, and set | |
3446 its value to the symbol-value-forward Lisp object; it also calls | |
3447 @code{staticpro()} on the C variable to tell the garbage-collection | |
3448 mechanism about the variable. When @code{eval} (or actually | |
3449 @code{symbol-value}) encounters this special object in the process of | |
3450 retrieving a variable's value, it follows the indirection to the C | |
3451 variable and gets its value. @code{setq} does similar things so that | |
3452 the C variable gets changed. | |
3453 | |
3454 Whether or not you @code{DEFVAR_LISP()} a variable, you need to | |
3455 initialize it in the @code{vars_of_*()} function; otherwise it will end | |
3456 up as all zeroes, which is the integer 0 (@emph{not} @code{nil}), and | |
3457 this is probably not what you want. Also, if the variable is not | |
3458 @code{DEFVAR_LISP()}ed, @strong{you must call} @code{staticpro()} on the | |
3459 C variable in the @code{vars_of_*()} function. Otherwise, the | |
3460 garbage-collection mechanism won't know that the object in this variable | |
3461 is in use, and will happily collect it and reuse its storage for another | |
3462 Lisp object, and you will be the one who's unhappy when you can't figure | |
3463 out how your variable got overwritten. | |
3464 | |
3465 @node Writing Macros, Proper Use of Unsigned Types, Adding Global Lisp Variables, Rules When Writing New C Code | |
3466 @section Writing Macros | |
3467 @cindex writing macros | |
3468 @cindex macros, writing | |
3469 | |
3470 The three golden rules of macros: | |
3471 | |
3472 @enumerate | |
3473 @item | |
3474 Anything that's an lvalue can be evaluated more than once. | |
3475 @item | |
3476 Macros where anything else can be evaluated more than once should | |
3477 have the word "unsafe" in their name (exceptions may be made for | |
3478 large sets of macros that evaluate arguments of certain types more | |
3479 than once, e.g. struct buffer * arguments, when clearly indicated in | |
3480 the macro documentation). These macros are generally meant to be | |
3481 called only by other macros that have already stored the calling | |
3482 values in temporary variables. | |
3483 @item | |
3484 Nothing else can be evaluated more than once. Use inline | |
3485 functions, if necessary, to prevent multiple evaluation. | |
3486 @end enumerate | |
3487 | |
3488 NOTE: The functions and macros below are given full prototypes in their | |
3489 docs, even when the implementation is a macro. In such cases, passing | |
3490 an argument of a type other than expected will produce undefined | |
3491 results. Also, given that macros can do things functions can't (in | |
3492 particular, directly modify arguments as if they were passed by | |
3493 reference), the declaration syntax has been extended to include the | |
3494 call-by-reference syntax from C++, where an & after a type indicates | |
3495 that the argument is an lvalue and is passed by reference, i.e. the | |
3496 function can modify its value. (This is equivalent in C to passing a | |
3497 pointer to the argument, but without the need to explicitly worry about | |
3498 pointers.) | |
3499 | |
3500 When to capitalize macros: | |
3501 | |
3502 @itemize @bullet | |
3503 @item | |
3504 Capitalize macros doing stuff obviously impossible with (C) | |
3505 functions, e.g. directly modifying arguments as if they were passed by | |
3506 reference. | |
3507 @item | |
3508 Capitalize macros that evaluate @strong{any} argument more than once regardless | |
3509 of whether that's "allowed" (e.g. buffer arguments). | |
3510 @item | |
3511 Capitalize macros that directly access a field in a Lisp_Object or | |
3512 its equivalent underlying structure. In such cases, access through the | |
3513 Lisp_Object precedes the macro with an X, and access through the underlying | |
3514 structure doesn't. | |
3515 @item | |
3516 Capitalize certain other basic macros relating to Lisp_Objects; e.g. | |
3517 FRAMEP, CHECK_FRAME, etc. | |
3518 @item | |
3519 Try to avoid capitalizing any other macros. | |
3520 @end itemize | |
3521 | |
3522 @node Proper Use of Unsigned Types, Techniques for XEmacs Developers, Writing Macros, Rules When Writing New C Code | |
3523 @section Proper Use of Unsigned Types | |
3524 @cindex unsigned types, proper use of | |
3525 @cindex types, proper use of unsigned | |
3526 | |
3527 Avoid using @code{unsigned int} and @code{unsigned long} whenever | |
3528 possible. Unsigned types are viral -- any arithmetic or comparisons | |
3529 involving mixed signed and unsigned types are automatically converted to | |
3530 unsigned, which is almost certainly not what you want. Many subtle and | |
3531 hard-to-find bugs are created by careless use of unsigned types. In | |
3532 general, you should almost @emph{never} use an unsigned type to hold a | |
3533 regular quantity of any sort. The only exceptions are | |
3534 | |
3535 @enumerate | |
3536 @item | |
3537 When there's a reasonable possibility you will actually need all 32 or | |
3538 64 bits to store the quantity. | |
3539 @item | |
3540 When calling existing API's that require unsigned types. In this case, | |
3541 you should still do all manipulation using signed types, and do the | |
3542 conversion at the very threshold of the API call. | |
3543 @item | |
3544 In existing code that you don't want to modify because you don't | |
3545 maintain it. | |
3546 @item | |
3547 In bit-field structures. | |
3548 @end enumerate | |
3549 | |
3550 Other reasonable uses of @code{unsigned int} and @code{unsigned long} | |
3551 are representing non-quantities -- e.g. bit-oriented flags and such. | |
3552 | |
3553 @node Techniques for XEmacs Developers, , Proper Use of Unsigned Types, Rules When Writing New C Code | |
3554 @section Techniques for XEmacs Developers | |
3555 @cindex techniques for XEmacs developers | |
3556 @cindex developers, techniques for XEmacs | |
3557 | |
3558 @cindex Purify | |
3559 @cindex Quantify | |
3560 To make a purified XEmacs, do: @code{make puremacs}. | |
3561 To make a quantified XEmacs, do: @code{make quantmacs}. | |
3562 | |
3563 You simply can't dump Quantified and Purified images (unless using the | |
3564 portable dumper). Purify gets confused when xemacs frees memory in one | |
3565 process that was allocated in a @emph{different} process on a different | |
3566 machine! Run it like so: | |
3567 @example | |
3568 temacs -batch -l loadup.el run-temacs @var{xemacs-args...} | |
3569 @end example | |
3570 | |
3571 @cindex error checking | |
3572 Before you go through the trouble, are you compiling with all | |
3573 debugging and error-checking off? If not, try that first. Be warned | |
3574 that while Quantify is directly responsible for quite a few | |
3575 optimizations which have been made to XEmacs, doing a run which | |
3576 generates results which can be acted upon is not necessarily a trivial | |
3577 task. | |
3578 | |
3579 Also, if you're still willing to do some runs make sure you configure | |
3580 with the @samp{--quantify} flag. That will keep Quantify from starting | |
3581 to record data until after the loadup is completed and will shut off | |
3582 recording right before it shuts down (which generates enough bogus data | |
3583 to throw most results off). It also enables three additional elisp | |
3584 commands: @code{quantify-start-recording-data}, | |
3585 @code{quantify-stop-recording-data} and @code{quantify-clear-data}. | |
3586 | |
3587 If you want to make XEmacs faster, target your favorite slow benchmark, | |
3588 run a profiler like Quantify, @code{gprof}, or @code{tcov}, and figure | |
3589 out where the cycles are going. In many cases you can localize the | |
3590 problem (because a particular new feature or even a single patch | |
3591 elicited it). Don't hesitate to use brute force techniques like a | |
3592 global counter incremented at strategic places, especially in | |
3593 combination with other performance indications (@emph{e.g.}, degree of | |
3594 buffer fragmentation into extents). | |
3595 | |
3596 Specific projects: | |
3597 | |
3598 @itemize @bullet | |
3599 @item | |
3600 Make the garbage collector faster. Figure out how to write an | |
3601 incremental garbage collector. | |
3602 @item | |
3603 Write a compiler that takes bytecode and spits out C code. | |
3604 Unfortunately, you will then need a C compiler and a more fully | |
3605 developed module system. | |
3606 @item | |
3607 Speed up redisplay. | |
3608 @item | |
3609 Speed up syntax highlighting. It was suggested that ``maybe moving some | |
3610 of the syntax highlighting capabilities into C would make a | |
3611 difference.'' Wrong idea, I think. When processing one 400kB file a | |
3612 particular low-level routine was being called 40 @emph{million} times | |
3613 simply for @emph{one} call to @code{newline-and-indent}. Syntax | |
3614 highlighting needs to be rewritten to use a reliable, fast parser, then | |
3615 to trust the pre-parsed structure, and only do re-highlighting locally | |
3616 to a text change. Modern machines are fast enough to implement such | |
3617 parsers in Lisp; but no machine will ever be fast enough to deal with | |
3618 quadratic (or worse) algorithms! | |
3619 @item | |
3620 Implement tail recursion in Emacs Lisp (hard!). | |
3621 @end itemize | |
3622 | |
3623 Unfortunately, Emacs Lisp is slow, and is going to stay slow. Function | |
3624 calls in elisp are especially expensive. Iterating over a long list is | |
3625 going to be 30 times faster implemented in C than in Elisp. | |
3626 | |
3627 Heavily used small code fragments need to be fast. The traditional way | |
3628 to implement such code fragments in C is with macros. But macros in C | |
3629 are known to be broken. | |
3630 | |
3631 @cindex macro hygiene | |
3632 Macro arguments that are repeatedly evaluated may suffer from repeated | |
3633 side effects or suboptimal performance. | |
3634 | |
3635 Variable names used in macros may collide with caller's variables, | |
3636 causing (at least) unwanted compiler warnings. | |
3637 | |
3638 In order to solve these problems, and maintain statement semantics, one | |
3639 should use the @code{do @{ ... @} while (0)} trick while trying to | |
3640 reference macro arguments exactly once using local variables. | |
3641 | |
3642 Let's take a look at this poor macro definition: | |
3643 | |
3644 @example | |
3645 #define MARK_OBJECT(obj) \ | |
3646 if (!marked_p (obj)) mark_object (obj), did_mark = 1 | |
3647 @end example | |
3648 | |
3649 This macro evaluates its argument twice, and also fails if used like this: | |
3650 @example | |
3651 if (flag) MARK_OBJECT (obj); else @code{do_something()}; | |
3652 @end example | |
3653 | |
3654 A much better definition is | |
3655 | |
3656 @example | |
3657 #define MARK_OBJECT(obj) do @{ \ | |
3658 Lisp_Object mo_obj = (obj); \ | |
3659 if (!marked_p (mo_obj)) \ | |
3660 @{ \ | |
3661 mark_object (mo_obj); \ | |
3662 did_mark = 1; \ | |
3663 @} \ | |
3664 @} while (0) | |
3665 @end example | |
3666 | |
3667 Notice the elimination of double evaluation by using the local variable | |
3668 with the obscure name. Writing safe and efficient macros requires great | |
3669 care. The one problem with macros that cannot be portably worked around | |
3670 is, since a C block has no value, a macro used as an expression rather | |
3671 than a statement cannot use the techniques just described to avoid | |
3672 multiple evaluation. | |
3673 | |
3674 @cindex inline functions | |
3675 In most cases where a macro has function semantics, an inline function | |
3676 is a better implementation technique. Modern compiler optimizers tend | |
3677 to inline functions even if they have no @code{inline} keyword, and | |
3678 configure magic ensures that the @code{inline} keyword can be safely | |
3679 used as an additional compiler hint. Inline functions used in a single | |
3680 .c files are easy. The function must already be defined to be | |
3681 @code{static}. Just add another @code{inline} keyword to the | |
3682 definition. | |
3683 | |
3684 @example | |
3685 inline static int | |
3686 heavily_used_small_function (int arg) | |
3687 @{ | |
3688 ... | |
3689 @} | |
3690 @end example | |
3691 | |
3692 Inline functions in header files are trickier, because we would like to | |
3693 make the following optimization if the function is @emph{not} inlined | |
3694 (for example, because we're compiling for debugging). We would like the | |
3695 function to be defined externally exactly once, and each calling | |
3696 translation unit would create an external reference to the function, | |
3697 instead of including a definition of the inline function in the object | |
3698 code of every translation unit that uses it. This optimization is | |
3699 currently only available for gcc. But you don't have to worry about the | |
3700 trickiness; just define your inline functions in header files using this | |
3701 pattern: | |
3702 | |
3703 @example | |
3704 INLINE_HEADER int | |
3705 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg); | |
3706 INLINE_HEADER int | |
3707 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg) | |
3708 @{ | |
3709 ... | |
3710 @} | |
3711 @end example | |
3712 | |
3713 The declaration right before the definition is to prevent warnings when | |
3714 compiling with @code{gcc -Wmissing-declarations}. I consider issuing | |
3715 this warning for inline functions a gcc bug, but the gcc maintainers disagree. | |
3716 | |
3717 @cindex inline functions, headers | |
3718 @cindex header files, inline functions | |
3719 Every header which contains inline functions, either directly by using | |
3720 @code{INLINE_HEADER} or indirectly by using @code{DECLARE_LRECORD} must | |
3721 be added to @file{inline.c}'s includes to make the optimization | |
3722 described above work. (Optimization note: if all INLINE_HEADER | |
3723 functions are in fact inlined in all translation units, then the linker | |
3724 can just discard @code{inline.o}, since it contains only unreferenced code). | |
3725 | |
3726 To get started debugging XEmacs, take a look at the @file{.gdbinit} and | |
3727 @file{.dbxrc} files in the @file{src} directory. See the section in the | |
3728 XEmacs FAQ on How to Debug an XEmacs problem with a debugger. | |
3729 | |
3730 After making source code changes, run @code{make check} to ensure that | |
3731 you haven't introduced any regressions. If you want to make xemacs more | |
3732 reliable, please improve the test suite in @file{tests/automated}. | |
3733 | |
3734 Did you make sure you didn't introduce any new compiler warnings? | |
3735 | |
3736 Before submitting a patch, please try compiling at least once with | |
3737 | |
3738 @example | |
3739 configure --with-mule --use-union-type --error-checking=all | |
3740 @end example | |
3741 | |
3742 Here are things to know when you create a new source file: | |
3743 | |
3744 @itemize @bullet | |
3745 @item | |
3746 All @file{.c} files should @code{#include <config.h>} first. Almost all | |
3747 @file{.c} files should @code{#include "lisp.h"} second. | |
3748 | |
3749 @item | |
3750 Generated header files should be included using the @samp{#include <...>} | |
3751 syntax, not the @samp{#include "..."} syntax. The generated headers are: | |
3752 | |
3753 @file{config.h sheap-adjust.h paths.h Emacs.ad.h} | |
3754 | |
3755 The basic rule is that you should assume builds using @samp{--srcdir} | |
3756 and the @samp{#include <...>} syntax needs to be used when the | |
3757 to-be-included generated file is in a potentially different directory | |
3758 @emph{at compile time}. The non-obvious C rule is that | |
3759 @samp{#include "..."} means to search for the included file in the same | |
3760 directory as the including file, @emph{not} in the current directory. | |
3761 Normally this is not a problem but when building with @samp{--srcdir}, | |
3762 @file{make} will search the @samp{VPATH} for you, while the C compiler | |
3763 knows nothing about it. | |
3764 | |
3765 @item | |
3766 Header files should @emph{not} include @samp{<config.h>} and | |
3767 @samp{"lisp.h"}. It is the responsibility of the @file{.c} files that | |
3768 use it to do so. | |
3769 | |
3770 @end itemize | |
3771 | |
3772 @cindex Lisp object types, creating | |
3773 @cindex creating Lisp object types | |
3774 @cindex object types, creating Lisp | |
3775 Here is a checklist of things to do when creating a new lisp object type | |
3776 named @var{foo}: | |
3777 | |
3778 @enumerate | |
3779 @item | |
3780 create @var{foo}.h | |
3781 @item | |
3782 create @var{foo}.c | |
3783 @item | |
3784 add definitions of @code{syms_of_@var{foo}}, etc. to @file{@var{foo}.c} | |
3785 @item | |
3786 add declarations of @code{syms_of_@var{foo}}, etc. to @file{symsinit.h} | |
3787 @item | |
3788 add calls to @code{syms_of_@var{foo}}, etc. to @file{emacs.c} | |
3789 @item | |
3790 add definitions of macros like @code{CHECK_@var{FOO}} and | |
3791 @code{@var{FOO}P} to @file{@var{foo}.h} | |
3792 @item | |
3793 add the new type index to @code{enum lrecord_type} | |
3794 @item | |
3795 add a DEFINE_LRECORD_IMPLEMENTATION call to @file{@var{foo}.c} | |
3796 @item | |
3797 add an INIT_LRECORD_IMPLEMENTATION call to @code{syms_of_@var{foo}.c} | |
3798 @end enumerate | |
3799 | |
3800 @node Regression Testing XEmacs, CVS Techniques, Rules When Writing New C Code, Top | |
3801 @chapter Regression Testing XEmacs | |
3802 @cindex testing, regression | |
3803 | |
3804 @menu | |
3805 * How to Regression-Test:: | |
3806 * Modules for Regression Testing:: | |
3807 @end menu | |
3808 | |
3809 @node How to Regression-Test, Modules for Regression Testing, Regression Testing XEmacs, Regression Testing XEmacs | |
3810 @section How to Regression-Test | |
3811 @cindex how to regression-test | |
3812 @cindex regression-test, how to | |
3813 @cindex testing, regression, how to | |
3814 | |
3815 The source directory @file{tests/automated} contains XEmacs' automated | |
3816 test suite. The usual way of running all the tests is running | |
3817 @code{make check} from the top-level build directory. | |
3818 | |
3819 The test suite is unfinished and it's still lacking some essential | |
3820 features. It is nevertheless recommended that you run the tests to | |
3821 confirm that XEmacs behaves correctly. | |
3822 | |
3823 If you want to run a specific test case, you can do it from the | |
3824 command-line like this: | |
3825 | |
3826 @example | |
3827 $ xemacs -batch -l test-harness.elc -f batch-test-emacs TEST-FILE | |
3828 @end example | |
3829 | |
3830 If a test fails and you need more information, you can run the test | |
3831 suite interactively by loading @file{test-harness.el} into a running | |
3832 XEmacs and typing @kbd{M-x test-emacs-test-file RET <filename> RET}. | |
3833 You will see a log of passed and failed tests, which should allow you to | |
3834 investigate the source of the error and ultimately fix the bug. If you | |
3835 are not capable of, or don't have time for, debugging it yourself, | |
3836 please do report the failures using @kbd{M-x report-emacs-bug} or | |
3837 @kbd{M-x build-report}. | |
3838 | |
3839 @deffn Command test-emacs-test-file file | |
3840 Runs the tests in @var{file}. @file{test-harness.el} must be loaded. | |
3841 Defines all the macros described in this node, and undefines them when | |
3842 done. | |
3843 @end deffn | |
3844 | |
3845 Adding a new test file is trivial: just create a new file here and it | |
3846 will be run. There is no need to byte-compile any of the files in | |
3847 this directory---the test-harness will take care of any necessary | |
3848 byte-compilation. | |
3849 | |
3850 Look at the existing test cases for the examples of coding test cases. | |
3851 It all boils down to your imagination and judicious use of the macros | |
3852 @code{Assert}, @code{Check-Error}, @code{Check-Error-Message}, and | |
3853 @code{Check-Message}. Note that all of these macros are defined only | |
3854 for the duration of the test: they do not exist in the global | |
3855 environment. | |
3856 | |
3857 @deffn Macro Assert expr | |
3858 Check that @var{expr} is non-nil at this point in the test. | |
3859 @end deffn | |
3860 | |
3861 @deffn Macro Check-Error expected-error body | |
3862 Check that execution of @var{body} causes @var{expected-error} to be | |
3863 signaled. @var{body} is a @code{progn}-like body, and may contain | |
3864 several expressions. @var{expected-error} is a symbol defined as | |
3865 an error by @code{define-error}. | |
3866 @end deffn | |
3867 | |
3868 @deffn Macro Check-Error-Message expected-error expected-error-regexp body | |
3869 Check that execution of @var{body} causes @var{expected-error} to be | |
3870 signaled, and generate a message matching @var{expected-error-regexp}. | |
3871 @var{body} is a @code{progn}-like body, and may contain several | |
3872 expressions. @var{expected-error} is a symbol defined as an error | |
3873 by @code{define-error}. | |
3874 @end deffn | |
3875 | |
3876 @deffn Macro Check-Message expected-message body | |
3877 Check that execution of @var{body} causes @var{expected-message} to be | |
3878 generated (using @code{message} or a similar function). @var{body} is a | |
3879 @code{progn}-like body, and may contain several expressions. | |
3880 @end deffn | |
3881 | |
3882 Here's a simple example checking case-sensitive and case-insensitive | |
3883 comparisons from @file{case-tests.el}. | |
3884 | |
3885 @example | |
3886 (with-temp-buffer | |
3887 (insert "Test Buffer") | |
3888 (let ((case-fold-search t)) | |
3889 (goto-char (point-min)) | |
3890 (Assert (eq (search-forward "test buffer" nil t) 12)) | |
3891 (goto-char (point-min)) | |
3892 (Assert (eq (search-forward "Test buffer" nil t) 12)) | |
3893 (goto-char (point-min)) | |
3894 (Assert (eq (search-forward "Test Buffer" nil t) 12)) | |
3895 | |
3896 (setq case-fold-search nil) | |
3897 (goto-char (point-min)) | |
3898 (Assert (not (search-forward "test buffer" nil t))) | |
3899 (goto-char (point-min)) | |
3900 (Assert (not (search-forward "Test buffer" nil t))) | |
3901 (goto-char (point-min)) | |
3902 (Assert (eq (search-forward "Test Buffer" nil t) 12)))) | |
3903 @end example | |
3904 | |
3905 This example could be saved in a file in @file{tests/automated}, and it | |
3906 would constitute a complete test, automatically executed when you run | |
3907 @kbd{make check} after building XEmacs. More complex tests may require | |
3908 substantial temporary scaffolding to create the environment that elicits | |
3909 the bugs, but the top-level @file{Makefile} and @file{test-harness.el} | |
3910 handle the running and collection of results from the @code{Assert}, | |
3911 @code{Check-Error}, @code{Check-Error-Message}, and @code{Check-Message} | |
3912 macros. | |
3913 | |
3914 Don't suppress tests just because they're due to known bugs not yet | |
3915 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark | |
3916 them. | |
3917 | |
3918 @deffn Macro Known-Bug-Expect-Failure body | |
3919 Arrange for failing tests in @var{body} to generate messages prefixed | |
3920 with "KNOWN BUG:" instead of "FAIL:". @var{body} is a @code{progn}-like | |
3921 body, and may contain several tests. | |
3922 @end deffn | |
3923 | |
3924 A lot of the tests we run push limits; suppress Ebola warning messages | |
3925 with the @code{Ignore-Ebola} wrapper macro. | |
3926 | |
3927 @deffn Macro Ignore-Ebola body | |
3928 Suppress Ebola warning messages while running tests in @var{body}. | |
3929 @var{body} is a @code{progn}-like body, and may contain several tests. | |
3930 @end deffn | |
3931 | |
3932 Both macros are defined temporarily within the test function. Simple | |
3933 examples: | |
3934 | |
3935 @example | |
3936 ;; Apparently Ignore-Ebola is a solution with no problem to address. | |
3937 ;; There are no examples in 21.5, anyway. | |
3938 | |
3939 ;; from regexp-tests.el | |
3940 (Known-Bug-Expect-Failure | |
3941 (Assert (not (string-match "\\b" ""))) | |
3942 (Assert (not (string-match " \\b" " ")))) | |
3943 @end example | |
3944 | |
3945 In general, you should avoid using functionality from packages in your | |
3946 tests, because you can't be sure that everyone will have the required | |
3947 package. However, if you've got a test that works, by all means add it. | |
3948 Simply wrap the test in an appropriate test, add a notice that the test | |
3949 was skipped, and update the @code{skipped-test-reasons} hashtable. The | |
3950 wrapper macro @code{Skip-Test-Unless} is provided to handle common | |
3951 cases. | |
3952 | |
3953 @defvar skipped-test-reasons | |
3954 Hash table counting the number of times a particular reason is given for | |
3955 skipping tests. This is only defined within @code{test-emacs-test-file}. | |
3956 @end defvar | |
3957 | |
3958 @deffn Macro Skip-Test-Unless prerequisite reason description body | |
3959 @var{prerequisite} is usually a feature test (@code{featurep}, | |
3960 @code{boundp}, @code{fboundp}). @var{reason} is a string describing the | |
3961 prerequisite; it must be unique because it is used as a hash key in a | |
3962 table of reasons for skipping tests. @var{description} describes the | |
3963 tests being skipped, for the test result summary. @var{body} is a | |
3964 @code{progn}-like body, and may contain several tests. | |
3965 @end deffn | |
3966 | |
3967 @code{Skip-Test-Unless} is defined temporarily within the test function. | |
3968 Here's an example of usage from @file{syntax-tests.el}: | |
3969 | |
3970 @example | |
3971 ;; Test forward-comment at buffer boundaries | |
3972 (with-temp-buffer | |
3973 ;; try to use exactly what you need: featurep, boundp, fboundp | |
3974 (Skip-Test-Unless (fboundp 'c-mode) | |
3975 "c-mode unavailable" | |
3976 "comment and parse-partial-sexp tests" | |
3977 ;; and here's the test code | |
3978 (c-mode) | |
3979 (insert "// comment\n") | |
3980 (forward-comment -2) | |
3981 (Assert (eq (point) (point-min))) | |
3982 (let ((point (point))) | |
3983 (insert "/* comment */") | |
3984 (goto-char point) | |
3985 (forward-comment 2) | |
3986 (Assert (eq (point) (point-max))) | |
3987 (parse-partial-sexp point (point-max))))) | |
3988 @end example | |
3989 | |
3990 @code{Skip-Test-Unless} is intended for use with features that are normally | |
3991 present in typical configurations. For truly optional features, or | |
3992 tests that apply to one of several alternative implementations (eg, to | |
3993 GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply | |
3994 silently suppress the test if the feature is not available. | |
3995 | |
3996 Here are a few general hints for writing tests. | |
3997 | |
3998 @enumerate | |
3999 @item | |
4000 Include related successful cases. Fixes often break something. | |
4001 | |
4002 @item | |
4003 Use the Known-Bug-Expect-Failure macro to mark the cases you know | |
4004 are going to fail. We want to be able to distinguish between | |
4005 regressions and other unexpected failures, and cases that have | |
4006 been (partially) analyzed but not yet repaired. | |
4007 | |
4008 @item | |
4009 Mark the bug with the date of report. An ``Unfixed since yyyy-mm-dd'' | |
4010 gloss for Known-Bug-Expect-Failure is planned to further increase | |
4011 developer embarrassment (== incentive to fix the bug), but until then at | |
4012 least put a comment about the date so we can easily see when it was | |
4013 first reported. | |
4014 | |
4015 @item | |
4016 It's a matter of your judgement, but you should often use generic tests | |
4017 (@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for | |
4018 numbers) even though you know that arguments ``should'' be of correct | |
4019 type. That is, if the functions used can return generic objects | |
4020 (typically @code{nil}), as well as some more specific type that will be | |
4021 returned on success. We don't want failures of those assertions | |
4022 reported as ``other failures'' (a wrong-type-arg signal, rather than a | |
4023 null return), we want them reported as ``assertion failures.'' | |
4024 | |
4025 One example is a test that tests @code{(= (string-match this that) 0)}, | |
4026 expecting a successful match. Now suppose @code{string-match} is broken | |
4027 such that the match fails. Then it will return @code{nil}, and @code{=} | |
4028 will signal ``wrong-type-argument, number-char-or-marker-p, nil'', | |
4029 generating an ``other failure'' in the report. But this should be | |
4030 reported as an assertion failure (the test failed in a foreseeable way), | |
4031 rather than something else (we don't know what happened because XEmacs | |
4032 is broken in a way that we weren't trying to test!) | |
4033 @end enumerate | |
4034 | |
4035 @node Modules for Regression Testing, , How to Regression-Test, Regression Testing XEmacs | |
4036 @section Modules for Regression Testing | |
4037 @cindex modules for regression testing | |
4038 @cindex regression testing, modules for | |
4039 | |
4040 @example | |
4041 @file{test-harness.el} | |
4042 @file{base64-tests.el} | |
4043 @file{byte-compiler-tests.el} | |
4044 @file{case-tests.el} | |
4045 @file{ccl-tests.el} | |
4046 @file{c-tests.el} | |
4047 @file{database-tests.el} | |
4048 @file{extent-tests.el} | |
4049 @file{hash-table-tests.el} | |
4050 @file{lisp-tests.el} | |
4051 @file{md5-tests.el} | |
4052 @file{mule-tests.el} | |
4053 @file{regexp-tests.el} | |
4054 @file{symbol-tests.el} | |
4055 @file{syntax-tests.el} | |
4056 @file{tag-tests.el} | |
4057 @file{weak-tests.el} | |
4058 @end example | |
4059 | |
4060 @file{test-harness.el} defines the macros @code{Assert}, | |
4061 @code{Check-Error}, @code{Check-Error-Message}, and | |
4062 @code{Check-Message}. The other files are test files, testing various | |
4063 XEmacs facilities. @xref{Regression Testing XEmacs}. | |
4064 | |
4065 | |
4066 @node CVS Techniques, The Modules of XEmacs, Regression Testing XEmacs, Top | |
4067 @chapter CVS Techniques | |
4068 @cindex CVS techniques | |
4069 | |
4070 @menu | |
4071 * Merging a Branch into the Trunk:: | |
4072 @end menu | |
4073 | |
4074 @node Merging a Branch into the Trunk, , CVS Techniques, CVS Techniques | |
4075 @section Merging a Branch into the Trunk | |
4076 @cindex merging a branch into the trunk | |
4077 | |
4078 @enumerate | |
4079 @item | |
4080 If you haven't already done a merge, you will be merging from the branch | |
4081 point; otherwise you'll be merging from the last merge point, which | |
4082 should be marked by a tag, e.g. @samp{last-sync-ben-mule-21-5}. In the | |
4083 former case, create the last-sync tag, e.g. | |
4084 | |
4085 @example | |
4086 crw rtag -r ben-mule-21-5-bp last-sync-ben-mule-21-5 xemacs | |
4087 @end example | |
4088 | |
4089 (You did create a branch point tag when you created the branch, didn't | |
4090 you?) | |
4091 | |
4092 @item | |
4093 Check everything in on your branch. | |
4094 | |
4095 @item | |
4096 Tag your branch with a pre-sync tag, e.g. | |
4097 | |
4098 @example | |
4099 crw rtag -r ben-mule-21-5 ben-mule-21-5-pre-feb-20-2002-sync xemacs | |
4100 @end example | |
4101 | |
4102 Note, you need to use rtag and specify a version with @samp{-r} (use | |
4103 @samp{-r HEAD} if necessary) so that removed files are handled correctly | |
4104 in some obscure cases. See section 4.8 of the CVS manual. | |
4105 | |
4106 @item | |
4107 Tag the trunk so you have a stable place to merge up to in case people | |
4108 are asynchronously committing to the trunk, e.g. | |
4109 | |
4110 @example | |
4111 crw rtag -r HEAD main-branch-ben-mule-21-5-syncpoint-feb-20-2002 xemacs | |
4112 crw rtag -F -r main-branch-ben-mule-21-5-syncpoint-feb-20-2002 next-sync-ben-mule-21-5 xemacs | |
4113 @end example | |
4114 | |
4115 Use -F in the second case because the name might already exist, e.g. if | |
4116 you've already done a merge. We make two tags because one is a | |
4117 permanent mark indicating a syncpoint when merging, and the other is a | |
4118 symbolic tag to make other operations easier. | |
4119 | |
4120 @item | |
4121 Make a backup of your source tree (not totally necessary but useful for | |
4122 reference and peace of mind): Move one level up from the top directory | |
4123 of your branch and do, e.g. | |
4124 | |
4125 @example | |
4126 cp -a mule mule-backup-2-23-02 | |
4127 @end example | |
4128 | |
4129 @item | |
4130 Now, we're ready to merge! Make sure you're in the top directory of | |
4131 your branch and do, e.g. | |
4132 | |
4133 @example | |
4134 cvs update -j last-sync-ben-mule-21-5 -j next-sync-ben-mule-21-5 | |
4135 @end example | |
4136 | |
4137 @item | |
4138 Fix all merge conflicts. Get the sucker to compile and run. | |
4139 | |
4140 @item | |
4141 Tag your branch with a post-sync tag, e.g. | |
4142 | |
4143 @example | |
4144 crw rtag -r ben-mule-21-5 ben-mule-21-5-post-feb-20-2002-sync xemacs | |
4145 @end example | |
4146 | |
4147 @item | |
4148 Update the last-sync tag, e.g. | |
4149 | |
4150 @example | |
4151 crw rtag -F -r next-sync-ben-mule-21-5 last-sync-ben-mule-21-5 xemacs | |
4152 @end example | |
4153 @end enumerate | |
4154 | |
4155 | |
4156 @node The Modules of XEmacs, Allocation of Objects in XEmacs Lisp, CVS Techniques, Top | |
4157 @chapter The Modules of XEmacs | 2060 @chapter The Modules of XEmacs |
4158 @cindex modules of XEmacs | 2061 @cindex modules of XEmacs |
4159 | 2062 |
4160 @menu | 2063 @menu |
4161 * A Summary of the Various XEmacs Modules:: | 2064 * A Summary of the Various XEmacs Modules:: |
5777 @end example | 3680 @end example |
5778 | 3681 |
5779 This module provides some terminal-control code necessary on versions of | 3682 This module provides some terminal-control code necessary on versions of |
5780 AIX prior to 4.1. | 3683 AIX prior to 4.1. |
5781 | 3684 |
5782 | 3685 @node Major Textual Changes, Rules When Writing New C Code, The Modules of XEmacs, Top |
5783 @node Allocation of Objects in XEmacs Lisp, Dumping, The Modules of XEmacs, Top | 3686 @chapter Major Textual Changes |
3687 @cindex textual changes, major | |
3688 @cindex major textual changes | |
3689 | |
3690 Sometimes major textual changes are made to the source. This means that | |
3691 a search-and-replace is done to change type names and such. Some people | |
3692 disagree with such changes, and certainly if done without good reason | |
3693 will just lead to headaches. But it's important to keep the code clean | |
3694 and understable, and consistent naming goes a long way towards this. | |
3695 | |
3696 An example of the right way to do this was the so-called "great integral | |
3697 type renaming". | |
3698 | |
3699 @menu | |
3700 * Great Integral Type Renaming:: | |
3701 * Text/Char Type Renaming:: | |
3702 @end menu | |
3703 | |
3704 @node Great Integral Type Renaming, Text/Char Type Renaming, Major Textual Changes, Major Textual Changes | |
3705 @section Great Integral Type Renaming | |
3706 @cindex Great Integral Type Renaming | |
3707 @cindex integral type renaming, great | |
3708 @cindex type renaming, integral | |
3709 @cindex renaming, integral types | |
3710 | |
3711 The purpose of this is to rationalize the names used for various | |
3712 integral types, so that they match their intended uses and follow | |
3713 consist conventions, and eliminate types that were not semantically | |
3714 different from each other. | |
3715 | |
3716 The conventions are: | |
3717 | |
3718 @itemize @bullet | |
3719 @item | |
3720 All integral types that measure quantities of anything are signed. Some | |
3721 people disagree vociferously with this, but their arguments are mostly | |
3722 theoretical, and are vastly outweighed by the practical headaches of | |
3723 mixing signed and unsigned values, and more importantly by the far | |
3724 increased likelihood of inadvertent bugs: Because of the broken "viral" | |
3725 nature of unsigned quantities in C (operations involving mixed | |
3726 signed/unsigned are done unsigned, when exactly the opposite is nearly | |
3727 always wanted), even a single error in declaring a quantity unsigned | |
3728 that should be signed, or even the even more subtle error of comparing | |
3729 signed and unsigned values and forgetting the necessary cast, can be | |
3730 catastrophic, as comparisons will yield wrong results. -Wsign-compare | |
3731 is turned on specifically to catch this, but this tends to result in a | |
3732 great number of warnings when mixing signed and unsigned, and the casts | |
3733 are annoying. More has been written on this elsewhere. | |
3734 | |
3735 @item | |
3736 All such quantity types just mentioned boil down to EMACS_INT, which is | |
3737 32 bits on 32-bit machines and 64 bits on 64-bit machines. This is | |
3738 guaranteed to be the same size as Lisp objects of type @code{int}, and (as | |
3739 far as I can tell) of size_t (unsigned!) and ssize_t. The only type | |
3740 below that is not an EMACS_INT is Hashcode, which is an unsigned value | |
3741 of the same size as EMACS_INT. | |
3742 | |
3743 @item | |
3744 Type names should be relatively short (no more than 10 characters or | |
3745 so), with the first letter capitalized and no underscores if they can at | |
3746 all be avoided. | |
3747 | |
3748 @item | |
3749 "count" == a zero-based measurement of some quantity. Includes sizes, | |
3750 offsets, and indexes. | |
3751 | |
3752 @item | |
3753 "bpos" == a one-based measurement of a position in a buffer. "Charbpos" | |
3754 and "Bytebpos" count text in the buffer, rather than bytes in memory; | |
3755 thus Bytebpos does not directly correspond to the memory representation. | |
3756 Use "Membpos" for this. | |
3757 | |
3758 @item | |
3759 "Char" refers to internal-format characters, not to the C type "char", | |
3760 which is really a byte. | |
3761 @end itemize | |
3762 | |
3763 For the actual name changes, see the script below. | |
3764 | |
3765 I ran the following script to do the conversion. (NOTE: This script is | |
3766 idempotent. You can safely run it multiple times and it will not screw | |
3767 up previous results -- in fact, it will do nothing if nothing has | |
3768 changed. Thus, it can be run repeatedly as necessary to handle patches | |
3769 coming in from old workspaces, or old branches.) There are two tags, | |
3770 just before and just after the change: @samp{pre-integral-type-rename} | |
3771 and @samp{post-integral-type-rename}. When merging code from the main | |
3772 trunk into a branch, the best thing to do is first merge up to | |
3773 @samp{pre-integral-type-rename}, then apply the script and associated | |
3774 changes, then merge from @samp{post-integral-type-change} to the | |
3775 present. (Alternatively, just do the merging in one operation; but you | |
3776 may then have a lot of conflicts needing to be resolved by hand.) | |
3777 | |
3778 Script @samp{fixtypes.sh} follows: | |
3779 | |
3780 @example | |
3781 ----------------------------------- cut ------------------------------------ | |
3782 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
3783 gr Memory_Count Bytecount $files | |
3784 gr Lstream_Data_Count Bytecount $files | |
3785 gr Element_Count Elemcount $files | |
3786 gr Hash_Code Hashcode $files | |
3787 gr extcount bytecount $files | |
3788 gr bufpos charbpos $files | |
3789 gr bytind bytebpos $files | |
3790 gr memind membpos $files | |
3791 gr bufbyte intbyte $files | |
3792 gr Extcount Bytecount $files | |
3793 gr Bufpos Charbpos $files | |
3794 gr Bytind Bytebpos $files | |
3795 gr Memind Membpos $files | |
3796 gr Bufbyte Intbyte $files | |
3797 gr EXTCOUNT BYTECOUNT $files | |
3798 gr BUFPOS CHARBPOS $files | |
3799 gr BYTIND BYTEBPOS $files | |
3800 gr MEMIND MEMBPOS $files | |
3801 gr BUFBYTE INTBYTE $files | |
3802 gr MEMORY_COUNT BYTECOUNT $files | |
3803 gr LSTREAM_DATA_COUNT BYTECOUNT $files | |
3804 gr ELEMENT_COUNT ELEMCOUNT $files | |
3805 gr HASH_CODE HASHCODE $files | |
3806 ----------------------------------- cut ------------------------------------ | |
3807 @end example | |
3808 | |
3809 The @samp{gr} script, and the scripts it uses, are documented in | |
3810 @file{README.global-renaming}, because if placed in this file they would | |
3811 need to have their @@ characters doubled, meaning you couldn't easily | |
3812 cut and paste from the source. | |
3813 | |
3814 In addition to those programs, I needed to fix up a few other | |
3815 things, particularly relating to the duplicate definitions of | |
3816 types, now that some types merged with others. Specifically: | |
3817 | |
3818 @enumerate | |
3819 @item | |
3820 in @file{lisp.h}, removed duplicate declarations of Bytecount. The changed | |
3821 code should now look like this: (In each code snippet below, the first | |
3822 and last lines are the same as the original, as are all lines outside of | |
3823 those lines. That allows you to locate the section to be replaced, and | |
3824 replace the stuff in that section, verifying that there isn't anything | |
3825 new added that would need to be kept.) | |
3826 | |
3827 @example | |
3828 --------------------------------- snip ------------------------------------- | |
3829 /* Counts of bytes or chars */ | |
3830 typedef EMACS_INT Bytecount; | |
3831 typedef EMACS_INT Charcount; | |
3832 | |
3833 /* Counts of elements */ | |
3834 typedef EMACS_INT Elemcount; | |
3835 | |
3836 /* Hash codes */ | |
3837 typedef unsigned long Hashcode; | |
3838 | |
3839 /* ------------------------ dynamic arrays ------------------- */ | |
3840 --------------------------------- snip ------------------------------------- | |
3841 @end example | |
3842 | |
3843 @item | |
3844 in @file{lstream.h}, removed duplicate declaration of Bytecount. Rewrote the | |
3845 comment about this type. The changed code should now look like this: | |
3846 | |
3847 @example | |
3848 --------------------------------- snip ------------------------------------- | |
3849 #endif | |
3850 | |
3851 /* The have been some arguments over the what the type should be that | |
3852 specifies a count of bytes in a data block to be written out or read in, | |
3853 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions. | |
3854 Originally it was long, which worked fine; Martin "corrected" these to | |
3855 size_t and ssize_t on the grounds that this is theoretically cleaner and | |
3856 is in keeping with the C standards. Unfortunately, this practice is | |
3857 horribly error-prone due to design flaws in the way that mixed | |
3858 signed/unsigned arithmetic happens. In fact, by doing this change, | |
3859 Martin introduced a subtle but fatal error that caused the operation of | |
3860 sending large mail messages to the SMTP server under Windows to fail. | |
3861 By putting all values back to be signed, avoiding any signed/unsigned | |
3862 mixing, the bug immediately went away. The type then in use was | |
3863 Lstream_Data_Count, so that it be reverted cleanly if a vote came to | |
3864 that. Now it is Bytecount. | |
3865 | |
3866 Some earlier comments about why the type must be signed: This MUST BE | |
3867 SIGNED, since it also is used in functions that return the number of | |
3868 bytes actually read to or written from in an operation, and these | |
3869 functions can return -1 to signal error. | |
3870 | |
3871 Note that the standard Unix @code{read()} and @code{write()} functions define the | |
3872 count going in as a size_t, which is UNSIGNED, and the count going | |
3873 out as an ssize_t, which is SIGNED. This is a horrible design | |
3874 flaw. Not only is it highly likely to lead to logic errors when a | |
3875 -1 gets interpreted as a large positive number, but operations are | |
3876 bound to fail in all sorts of horrible ways when a number in the | |
3877 upper-half of the size_t range is passed in -- this number is | |
3878 unrepresentable as an ssize_t, so code that checks to see how many | |
3879 bytes are actually written (which is mandatory if you are dealing | |
3880 with certain types of devices) will get completely screwed up. | |
3881 | |
3882 --ben | |
3883 */ | |
3884 | |
3885 typedef enum lstream_buffering | |
3886 --------------------------------- snip ------------------------------------- | |
3887 @end example | |
3888 | |
3889 @item | |
3890 in @file{dumper.c}, there are four places, all inside of @code{switch()} statements, | |
3891 where XD_BYTECOUNT appears twice as a case tag. In each case, the two | |
3892 case blocks contain identical code, and you should *REMOVE THE SECOND* | |
3893 and leave the first. | |
3894 @end enumerate | |
3895 | |
3896 @node Text/Char Type Renaming, , Great Integral Type Renaming, Major Textual Changes | |
3897 @section Text/Char Type Renaming | |
3898 @cindex Text/Char Type Renaming | |
3899 @cindex type renaming, text/char | |
3900 @cindex renaming, text/char types | |
3901 | |
3902 The purpose of this was | |
3903 | |
3904 @enumerate | |
3905 @item | |
3906 To distinguish between ``charptr'' when it refers to operations on | |
3907 the pointer itself and when it refers to operations on text | |
3908 @item | |
3909 To use consistent naming for everything referring to internal format, i.e. | |
3910 @end enumerate | |
3911 | |
3912 @example | |
3913 Itext == text in internal format | |
3914 Ibyte == a byte in such text | |
3915 Ichar == a char as represented in internal character format | |
3916 @end example | |
3917 | |
3918 Thus e.g. | |
3919 | |
3920 @example | |
3921 set_charptr_emchar -> set_itext_ichar | |
3922 @end example | |
3923 | |
3924 This was done using a script like this: | |
3925 | |
3926 @example | |
3927 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
3928 gr Intbyte Ibyte $files | |
3929 gr INTBYTE IBYTE $files | |
3930 gr intbyte ibyte $files | |
3931 gr EMCHAR ICHAR $files | |
3932 gr emchar ichar $files | |
3933 gr Emchar Ichar $files | |
3934 gr INC_CHARPTR INC_IBYTEPTR $files | |
3935 gr DEC_CHARPTR DEC_IBYTEPTR $files | |
3936 gr VALIDATE_CHARPTR VALIDATE_IBYTEPTR $files | |
3937 gr valid_charptr valid_ibyteptr $files | |
3938 gr CHARPTR ITEXT $files | |
3939 gr charptr itext $files | |
3940 gr Charptr Itext $files | |
3941 @end example | |
3942 | |
3943 See above for the source to @samp{gr}. | |
3944 | |
3945 As in the integral-types change, there are pre and post tags before and | |
3946 after the change: | |
3947 | |
3948 @example | |
3949 pre-internal-format-textual-renaming | |
3950 post-internal-format-textual-renaming | |
3951 @end example | |
3952 | |
3953 When merging a large branch, follow the same sort of procedure | |
3954 documented above, using these tags -- essentially sync up to the pre | |
3955 tag, then apply the script yourself, then sync from the post tag to the | |
3956 present. You can probably do the same if you don't have a separate | |
3957 workspace, but do have lots of outstanding changes and you'd rather not | |
3958 just merge all the textual changes directly. Use something like this: | |
3959 | |
3960 (WARNING: I'm not a CVS guru; before trying this, or any large operation | |
3961 that might potentially mess things up, @strong{DEFINITELY} make a backup of | |
3962 your existing workspace.) | |
3963 | |
3964 @example | |
3965 cup -r pre-internal-format-textual-renaming | |
3966 <apply script> | |
3967 cup -A -j post-internal-format-textual-renaming -j HEAD | |
3968 @end example | |
3969 | |
3970 This might also work: | |
3971 | |
3972 @example | |
3973 cup -j pre-internal-format-textual-renaming | |
3974 <apply script> | |
3975 cup -j post-internal-format-textual-renaming -j HEAD | |
3976 @end example | |
3977 | |
3978 ben | |
3979 | |
3980 The following is a script to go in the opposite direction: | |
3981 | |
3982 @example | |
3983 files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]" | |
3984 | |
3985 # Evidently Perl considers _ to be a word char ala \b, even though XEmacs | |
3986 # doesn't. We need to be careful here with ibyte/ichar because of words | |
3987 # like Richard, @code{eicharlen()}, multibyte, HIBYTE, etc. | |
3988 | |
3989 gr Ibyte Intbyte $files | |
3990 gr '\bIBYTE' INTBYTE $files | |
3991 gr '\bibyte' intbyte $files | |
3992 gr '\bICHAR' EMCHAR $files | |
3993 gr '\bichar' emchar $files | |
3994 gr '\bIchar' Emchar $files | |
3995 gr '\bIBYTEPTR' CHARPTR $files | |
3996 gr '\bibyteptr' charptr $files | |
3997 gr '\bITEXT' CHARPTR $files | |
3998 gr '\bitext' charptr $files | |
3999 gr '\bItext' CHARPTR $files | |
4000 | |
4001 gr '_IBYTE' _INTBYTE $files | |
4002 gr '_ibyte' _intbyte $files | |
4003 gr '_ICHAR' _EMCHAR $files | |
4004 gr '_ichar' _emchar $files | |
4005 gr '_Ichar' _Emchar $files | |
4006 gr '_IBYTEPTR' _CHARPTR $files | |
4007 gr '_ibyteptr' _charptr $files | |
4008 gr '_ITEXT' _CHARPTR $files | |
4009 gr '_itext' _charptr $files | |
4010 gr '_Itext' _CHARPTR $files | |
4011 @end example | |
4012 | |
4013 @node Rules When Writing New C Code, Regression Testing XEmacs, Major Textual Changes, Top | |
4014 @chapter Rules When Writing New C Code | |
4015 @cindex writing new C code, rules when | |
4016 @cindex C code, rules when writing new | |
4017 @cindex code, rules when writing new C | |
4018 | |
4019 The XEmacs C Code is extremely complex and intricate, and there are many | |
4020 rules that are more or less consistently followed throughout the code. | |
4021 Many of these rules are not obvious, so they are explained here. It is | |
4022 of the utmost importance that you follow them. If you don't, you may | |
4023 get something that appears to work, but which will crash in odd | |
4024 situations, often in code far away from where the actual breakage is. | |
4025 | |
4026 @menu | |
4027 * A Reader's Guide to XEmacs Coding Conventions:: | |
4028 * General Coding Rules:: | |
4029 * Object-Oriented Techniques for C:: | |
4030 * Writing Lisp Primitives:: | |
4031 * Writing Good Comments:: | |
4032 * Adding Global Lisp Variables:: | |
4033 * Writing Macros:: | |
4034 * Proper Use of Unsigned Types:: | |
4035 * Techniques for XEmacs Developers:: | |
4036 @end menu | |
4037 | |
4038 See also @ref{Coding for Mule}. | |
4039 | |
4040 @node A Reader's Guide to XEmacs Coding Conventions, General Coding Rules, Rules When Writing New C Code, Rules When Writing New C Code | |
4041 @section A Reader's Guide to XEmacs Coding Conventions | |
4042 @cindex coding conventions | |
4043 @cindex reader's guide | |
4044 @cindex coding rules, naming | |
4045 | |
4046 Of course the low-level implementation language of XEmacs is C, but much | |
4047 of that uses the Lisp engine to do its work. However, because the code | |
4048 is ``inside'' of the protective containment shell around the ``reactor | |
4049 core,'' you'll see lots of complex ``plumbing'' needed to do the work | |
4050 and ``safety mechanisms,'' whose failure results in a meltdown. This | |
4051 section provides a quick overview (or review) of the various components | |
4052 of the implementation of Lisp objects. | |
4053 | |
4054 Two typographic conventions help to identify C objects that implement | |
4055 Lisp objects. The first is that capitalized identifiers, especially | |
4056 beginning with the letters @samp{Q}, @samp{V}, @samp{F}, and @samp{S}, | |
4057 for C variables and functions, and C macros with beginning with the | |
4058 letter @samp{X}, are used to implement Lisp. The second is that where | |
4059 Lisp uses the hyphen @samp{-} in symbol names, the corresponding C | |
4060 identifiers use the underscore @samp{_}. Of course, since XEmacs Lisp | |
4061 contains interfaces to many external libraries, those external names | |
4062 will follow the coding conventions their authors chose, and may overlap | |
4063 the ``XEmacs name space.'' However these cases are usually pretty | |
4064 obvious. | |
4065 | |
4066 All Lisp objects are handled indirectly. The @code{Lisp_Object} | |
4067 type is usually a pointer to a structure, except for a very small number | |
4068 of types with immediate representations (currently characters and | |
4069 integers). However, these types cannot be directly operated on in C | |
4070 code, either, so they can also be considered indirect. Types that do | |
4071 not have an immediate representation always have a C typedef | |
4072 @code{Lisp_@var{type}} for a corresponding structure. | |
4073 @c #### mention l(c)records here? | |
4074 | |
4075 In older code, it was common practice to pass around pointers to | |
4076 @code{Lisp_@var{type}}, but this is now deprecated in favor of using | |
4077 @code{Lisp_Object} for all function arguments and return values that are | |
4078 Lisp objects. The @code{X@var{type}} macro is used to extract the | |
4079 pointer and cast it to @code{(Lisp_@var{type} *)} for the desired type. | |
4080 | |
4081 @strong{Convention}: macros whose names begin with @samp{X} operate on | |
4082 @code{Lisp_Object}s and do no type-checking. Many such macros are type | |
4083 extractors, but others implement Lisp operations in C (@emph{e.g.}, | |
4084 @code{XCAR} implements the Lisp @code{car} function). These are unsafe, | |
4085 and must only be used where types of all data have already been checked. | |
4086 Such macros are only applied to @code{Lisp_Object}s. In internal | |
4087 implementations where the pointer has already been converted, the | |
4088 structure is operated on directly using the C @code{->} member access | |
4089 operator. | |
4090 | |
4091 The @code{@var{type}P}, @code{CHECK_@var{type}}, and | |
4092 @code{CONCHECK_@var{type}} macros are used to test types. The first | |
4093 returns a Boolean value, and the latter signal errors. (The | |
4094 @samp{CONCHECK} variety allows execution to be CONtinued under some | |
4095 circumstances, thus the name.) Functions which expect to be passed user | |
4096 data invariably call @samp{CHECK} macros on arguments. | |
4097 | |
4098 There are many types of specialized Lisp objects implemented in C, but | |
4099 the most pervasive type is the @dfn{symbol}. Symbols are used as | |
4100 identifiers, variables, and functions. | |
4101 | |
4102 @strong{Convention}: Global variables whose names begin with @samp{Q} | |
4103 are constants whose value is a symbol. The name of the variable should | |
4104 be derived from the name of the symbol using the same rules as for Lisp | |
4105 primitives. Such variables allow the C code to check whether a | |
4106 particular @code{Lisp_Object} is equal to a given symbol. Symbols are | |
4107 Lisp objects, so these variables may be passed to Lisp primitives. (An | |
4108 alternative to the use of @samp{Q...} variables is to call the | |
4109 @code{intern} function at initialization in the | |
4110 @code{vars_of_@var{module}} function, which is hardly less efficient.) | |
4111 | |
4112 @strong{Convention}: Global variables whose names begin with @samp{V} | |
4113 are variables that contain Lisp objects. The convention here is that | |
4114 all global variables of type @code{Lisp_Object} begin with @samp{V}, and | |
4115 no others do (not even integer and boolean variables that have Lisp | |
4116 equivalents). Most of the time, these variables have equivalents in | |
4117 Lisp, which are defined via the @samp{DEFVAR} family of macros, but some | |
4118 don't. Since the variable's value is a @code{Lisp_Object}, it can be | |
4119 passed to Lisp primitives. | |
4120 | |
4121 The implementation of Lisp primitives is more complex. | |
4122 @strong{Convention}: Global variables with names beginning with @samp{S} | |
4123 contain a structure that allows the Lisp engine to identify and call a C | |
4124 function. In modern versions of XEmacs, these identifiers are almost | |
4125 always completely hidden in the @code{DEFUN} and @code{SUBR} macros, but | |
4126 you will encounter them if you look at very old versions of XEmacs or at | |
4127 GNU Emacs. @strong{Convention}: Functions with names beginning with | |
4128 @samp{F} implement Lisp primitives. Of course all their arguments and | |
4129 their return values must be Lisp_Objects. (This is hidden in the | |
4130 @code{DEFUN} macro.) | |
4131 | |
4132 | |
4133 @node General Coding Rules, Object-Oriented Techniques for C, A Reader's Guide to XEmacs Coding Conventions, Rules When Writing New C Code | |
4134 @section General Coding Rules | |
4135 @cindex coding rules, general | |
4136 | |
4137 The C code is actually written in a dialect of C called @dfn{Clean C}, | |
4138 meaning that it can be compiled, mostly warning-free, with either a C or | |
4139 C++ compiler. Coding in Clean C has several advantages over plain C. | |
4140 C++ compilers are more nit-picking, and a number of coding errors have | |
4141 been found by compiling with C++. The ability to use both C and C++ | |
4142 tools means that a greater variety of development tools are available to | |
4143 the developer. In addition, the ability to overload operators in C++ | |
4144 means it is possible, for error-checking purposes, to redefine certain | |
4145 simple types (normally defined as aliases for simple built-in types such | |
4146 as @code{unsigned char} or @code{long}) as classes, strictly limiting the permissible | |
4147 operations and catching illegal implicit casts and such. | |
4148 | |
4149 Every module includes @file{<config.h>} (angle brackets so that | |
4150 @samp{--srcdir} works correctly; @file{config.h} may or may not be in | |
4151 the same directory as the C sources) and @file{lisp.h}. @file{config.h} | |
4152 must always be included before any other header files (including | |
4153 system header files) to ensure that certain tricks played by various | |
4154 @file{s/} and @file{m/} files work out correctly. | |
4155 | |
4156 When including header files, always use angle brackets, not double | |
4157 quotes, except when the file to be included is always in the same | |
4158 directory as the including file. If either file is a generated file, | |
4159 then that is not likely to be the case. In order to understand why we | |
4160 have this rule, imagine what happens when you do a build in the source | |
4161 directory using @samp{./configure} and another build in another | |
4162 directory using @samp{../work/configure}. There will be two different | |
4163 @file{config.h} files. Which one will be used if you @samp{#include | |
4164 "config.h"}? | |
4165 | |
4166 Almost every module contains a @code{syms_of_*()} function and a | |
4167 @code{vars_of_*()} function. The former declares any Lisp primitives | |
4168 you have defined and defines any symbols you will be using. The latter | |
4169 declares any global Lisp variables you have added and initializes global | |
4170 C variables in the module. @strong{Important}: There are stringent | |
4171 requirements on exactly what can go into these functions. See the | |
4172 comment in @file{emacs.c}. The reason for this is to avoid obscure | |
4173 unwanted interactions during initialization. If you don't follow these | |
4174 rules, you'll be sorry! If you want to do anything that isn't allowed, | |
4175 create a @code{complex_vars_of_*()} function for it. Doing this is | |
4176 tricky, though: you have to make sure your function is called at the | |
4177 right time so that all the initialization dependencies work out. | |
4178 | |
4179 Declare each function of these kinds in @file{symsinit.h}. Make sure | |
4180 it's called in the appropriate place in @file{emacs.c}. You never need | |
4181 to include @file{symsinit.h} directly, because it is included by | |
4182 @file{lisp.h}. | |
4183 | |
4184 @strong{All global and static variables that are to be modifiable must | |
4185 be declared uninitialized.} This means that you may not use the | |
4186 ``declare with initializer'' form for these variables, such as @code{int | |
4187 some_variable = 0;}. The reason for this has to do with some kludges | |
4188 done during the dumping process: If possible, the initialized data | |
4189 segment is re-mapped so that it becomes part of the (unmodifiable) code | |
4190 segment in the dumped executable. This allows this memory to be shared | |
4191 among multiple running XEmacs processes. XEmacs is careful to place as | |
4192 much constant data as possible into initialized variables during the | |
4193 @file{temacs} phase. | |
4194 | |
4195 @cindex copy-on-write | |
4196 @strong{Please note:} This kludge only works on a few systems nowadays, | |
4197 and is rapidly becoming irrelevant because most modern operating systems | |
4198 provide @dfn{copy-on-write} semantics. All data is initially shared | |
4199 between processes, and a private copy is automatically made (on a | |
4200 page-by-page basis) when a process first attempts to write to a page of | |
4201 memory. | |
4202 | |
4203 Formerly, there was a requirement that static variables not be declared | |
4204 inside of functions. This had to do with another hack along the same | |
4205 vein as what was just described: old USG systems put statically-declared | |
4206 variables in the initialized data space, so those header files had a | |
4207 @code{#define static} declaration. (That way, the data-segment remapping | |
4208 described above could still work.) This fails badly on static variables | |
4209 inside of functions, which suddenly become automatic variables; | |
4210 therefore, you weren't supposed to have any of them. This awful kludge | |
4211 has been removed in XEmacs because | |
4212 | |
4213 @enumerate | |
4214 @item | |
4215 almost all of the systems that used this kludge ended up having | |
4216 to disable the data-segment remapping anyway; | |
4217 @item | |
4218 the only systems that didn't were extremely outdated ones; | |
4219 @item | |
4220 this hack completely messed up inline functions. | |
4221 @end enumerate | |
4222 | |
4223 The C source code makes heavy use of C preprocessor macros. One popular | |
4224 macro style is: | |
4225 | |
4226 @example | |
4227 #define FOO(var, value) do @{ \ | |
4228 Lisp_Object FOO_value = (value); \ | |
4229 ... /* compute using FOO_value */ \ | |
4230 (var) = bar; \ | |
4231 @} while (0) | |
4232 @end example | |
4233 | |
4234 The @code{do @{...@} while (0)} is a standard trick to allow FOO to have | |
4235 statement semantics, so that it can safely be used within an @code{if} | |
4236 statement in C, for example. Multiple evaluation is prevented by | |
4237 copying a supplied argument into a local variable, so that | |
4238 @code{FOO(var,fun(1))} only calls @code{fun} once. | |
4239 | |
4240 Lisp lists are popular data structures in the C code as well as in | |
4241 Elisp. There are two sets of macros that iterate over lists. | |
4242 @code{EXTERNAL_LIST_LOOP_@var{n}} should be used when the list has been | |
4243 supplied by the user, and cannot be trusted to be acyclic and | |
4244 @code{nil}-terminated. A @code{malformed-list} or @code{circular-list} error | |
4245 will be generated if the list being iterated over is not entirely | |
4246 kosher. @code{LIST_LOOP_@var{n}}, on the other hand, is faster and less | |
4247 safe, and can be used only on trusted lists. | |
4248 | |
4249 Related macros are @code{GET_EXTERNAL_LIST_LENGTH} and | |
4250 @code{GET_LIST_LENGTH}, which calculate the length of a list, and in the | |
4251 case of @code{GET_EXTERNAL_LIST_LENGTH}, validating the properness of | |
4252 the list. The macros @code{EXTERNAL_LIST_LOOP_DELETE_IF} and | |
4253 @code{LIST_LOOP_DELETE_IF} delete elements from a lisp list satisfying some | |
4254 predicate. | |
4255 | |
4256 @node Object-Oriented Techniques for C, Writing Lisp Primitives, General Coding Rules, Rules When Writing New C Code | |
4257 @section Object-Oriented Techniques for C | |
4258 @cindex coding rules, object-oriented | |
4259 @cindex object-oriented techniques | |
4260 | |
4261 At the lowest levels, XEmacs makes heavy use of object-oriented | |
4262 techniques to promote code-sharing and uniform interfaces for different | |
4263 devices and platforms. Commonly, but not always, such objects are | |
4264 ``wrapped'' and exported to Lisp as Lisp objects. Usually they use | |
4265 the internal structures developed for Lisp objects (the @samp{lrecord} | |
4266 structure) in order to take advantage of Lisp memory management. | |
4267 Unfortunately, XEmacs was originally written in C, so these techniques | |
4268 are based on heavy use of C macros. | |
4269 | |
4270 @c You can't use @var{} for type below, because case is important. | |
4271 A module defining a class is likely to use most of the following | |
4272 declarations and macros. In the following, the notation @samp{<type>} | |
4273 will stand for the full name of the class, and will be capitalized in | |
4274 the way normal for its context. The notation @samp{<typ>} will stand | |
4275 for the abbreviated form commonly used in macro names, while @samp{ty} | |
4276 will be used as the typical name for instances of the class. (See the | |
4277 entry for @samp{MAYBE_<TY>METH} below for an example using all three | |
4278 notations.) | |
4279 | |
4280 In the interface (@file{.h} file), the following declarations are used | |
4281 often. Others may be used in for particular modules. Since they're | |
4282 quite short in most cases, the definitions are given as well. The | |
4283 generic macros used are defined in @file{lisp.h} or @file{lrecord.h}. | |
4284 | |
4285 @c #### reorganize this table into stuff used in general code, and stuff | |
4286 @c used only in declarations or initializations | |
4287 @table @samp | |
4288 @c #### declaration | |
4289 @item typedef struct Lisp_<Type> Lisp_<Type> | |
4290 This refers to the internal structure used by C code. The XEmacs coding | |
4291 style now forbids passing pointers to @samp{Lisp_<Type>} structures into | |
4292 or out of a function; instead, a @samp{Lisp_Object} should be passed or | |
4293 returned (created using @samp{wrap_<type>}, if necessary). | |
4294 | |
4295 @c #### declaration | |
4296 @item DECLARE_LRECORD (<type>, Lisp_<Type>) | |
4297 Declares an @samp{lrecord} for @samp{<Type>}, which is the unit of | |
4298 allocation. | |
4299 | |
4300 @item #define X<TYPE>(x) XRECORD (x, <type>, Lisp_<Type>) | |
4301 Turns a @code{Lisp_Object} into a pointer to @samp{struct Lisp_<Type>}. | |
4302 | |
4303 @item #define wrap_<type>(p) wrap_record (p, <type>) | |
4304 Turns a pointer to @samp{struct Lisp_<Type>} into a @code{Lisp_Object}. | |
4305 | |
4306 @item #define <TYPE>P(x) RECORDP (x, <type>) | |
4307 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}. | |
4308 Returns a C int, not a Lisp Boolean value. | |
4309 | |
4310 @item #define CHECK_<TYPE>(x) CHECK_RECORD (x, <type>) | |
4311 @itemx #define CONCHECK_<TYPE>(x) CONCHECK_RECORD (x, <type>) | |
4312 Tests whether a given @code{Lisp_Object} is of type @samp{Lisp_<Type>}, | |
4313 and signals a Lisp error if not. The @samp{CHECK} version of the macro | |
4314 never returns if the type is wrong, while the @samp{CONCHECK} version | |
4315 can return if the user catches it in the debugger and explicitly | |
4316 requests a return. | |
4317 | |
4318 @item #define RAW_<TYP>METH(ty, m) ((ty)->methods->m##_method) | |
4319 Return a function pointer for the method for an object @var{TY} of class | |
4320 @samp{Lisp_<Type>}, or @samp{NULL} if there is none for this type. | |
4321 | |
4322 @item #define HAS_<TYP>METH_P(ty, m) (!!RAW_<TYP>METH (ty, m)) | |
4323 Test whether the class that @var{TY} is an instance of has the method. | |
4324 | |
4325 @item #define <TYP>METH(ty, m, args) ((RAW_<TYP>METH (ty, m)) args) | |
4326 Call the method on @samp{args}. @samp{args} must be enclosed in | |
4327 parentheses in the call. It is the programmer's responsibility to | |
4328 ensure that the method is available. The standard convenience macro | |
4329 @samp{MAYBE_<TYP>METH} is often provided for the common case where a | |
4330 void-returning method of @samp{Type} is called. | |
4331 | |
4332 @item #define MAYBE_<TYP>METH(ty, m, args) do @{ ... @} while (0) | |
4333 Call a void-returning @samp{<Type>} method, if it exists. Note the use | |
4334 of the @samp{do ... while (0)} idiom to give the macro call C statement | |
4335 semantics. The full definition is equally idiomatic: | |
4336 | |
4337 @example | |
4338 #define MAYBE_<TYP>METH(ty, m, args) do @{ \ | |
4339 Lisp_<Type> *maybe_<typ>meth_ty = (ty); \ | |
4340 if (HAS_<TYP>METH_P (maybe_<typ>meth_ty, m)) \ | |
4341 <TYP>METH (maybe_<typ>meth_ty, m, args); \ | |
4342 @} while (0) | |
4343 @end example | |
4344 @end table | |
4345 | |
4346 The use of macros for invoking an object's methods makes life a bit | |
4347 difficult for the student or maintainer when browsing the code. In | |
4348 particular, calls are of the form @samp{<TYP>METH (ty, some_method, (x, | |
4349 y))}, but definitions typically are for @samp{<subtype>_some_method}. | |
4350 Thus, when you are trying to find calls, you need to grep for | |
4351 @samp{some_method}, but this will also catch calls and definitions of | |
4352 that method for instances of other subtypes of @samp{<Type>}, and there | |
4353 may be a rather large number of them. | |
4354 | |
4355 | |
4356 @node Writing Lisp Primitives, Writing Good Comments, Object-Oriented Techniques for C, Rules When Writing New C Code | |
4357 @section Writing Lisp Primitives | |
4358 @cindex writing Lisp primitives | |
4359 @cindex Lisp primitives, writing | |
4360 @cindex primitives, writing Lisp | |
4361 | |
4362 Lisp primitives are Lisp functions implemented in C. The details of | |
4363 interfacing the C function so that Lisp can call it are handled by a few | |
4364 C macros. The only way to really understand how to write new C code is | |
4365 to read the source, but we can explain some things here. | |
4366 | |
4367 An example of a special form is the definition of @code{prog1}, from | |
4368 @file{eval.c}. (An ordinary function would have the same general | |
4369 appearance.) | |
4370 | |
4371 @cindex garbage collection protection | |
4372 @smallexample | |
4373 @group | |
4374 DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /* | |
4375 Similar to `progn', but the value of the first form is returned. | |
4376 \(prog1 FIRST BODY...): All the arguments are evaluated sequentially. | |
4377 The value of FIRST is saved during evaluation of the remaining args, | |
4378 whose values are discarded. | |
4379 */ | |
4380 (args)) | |
4381 @{ | |
4382 /* This function can GC */ | |
4383 REGISTER Lisp_Object val, form, tail; | |
4384 struct gcpro gcpro1; | |
4385 | |
4386 val = Feval (XCAR (args)); | |
4387 | |
4388 GCPRO1 (val); | |
4389 | |
4390 LIST_LOOP_3 (form, XCDR (args), tail) | |
4391 Feval (form); | |
4392 | |
4393 UNGCPRO; | |
4394 return val; | |
4395 @} | |
4396 @end group | |
4397 @end smallexample | |
4398 | |
4399 Let's start with a precise explanation of the arguments to the | |
4400 @code{DEFUN} macro. Here is a template for them: | |
4401 | |
4402 @example | |
4403 @group | |
4404 DEFUN (@var{lname}, @var{fname}, @var{min_args}, @var{max_args}, @var{interactive}, /* | |
4405 @var{docstring} | |
4406 */ | |
4407 (@var{arglist})) | |
4408 @end group | |
4409 @end example | |
4410 | |
4411 @table @var | |
4412 @item lname | |
4413 This string is the name of the Lisp symbol to define as the function | |
4414 name; in the example above, it is @code{"prog1"}. | |
4415 | |
4416 @item fname | |
4417 This is the C function name for this function. This is the name that is | |
4418 used in C code for calling the function. The name is, by convention, | |
4419 @samp{F} prepended to the Lisp name, with all dashes (@samp{-}) in the | |
4420 Lisp name changed to underscores. Thus, to call this function from C | |
4421 code, call @code{Fprog1}. Remember that the arguments are of type | |
4422 @code{Lisp_Object}; various macros and functions for creating values of | |
4423 type @code{Lisp_Object} are declared in the file @file{lisp.h}. | |
4424 | |
4425 Primitives whose names are special characters (e.g. @code{+} or | |
4426 @code{<}) are named by spelling out, in some fashion, the special | |
4427 character: e.g. @code{Fplus()} or @code{Flss()}. Primitives whose names | |
4428 begin with normal alphanumeric characters but also contain special | |
4429 characters are spelled out in some creative way, e.g. @code{let*} | |
4430 becomes @code{FletX()}. | |
4431 | |
4432 Each function also has an associated structure that holds the data for | |
4433 the subr object that represents the function in Lisp. This structure | |
4434 conveys the Lisp symbol name to the initialization routine that will | |
4435 create the symbol and store the subr object as its definition. The C | |
4436 variable name of this structure is always @samp{S} prepended to the | |
4437 @var{fname}. You hardly ever need to be aware of the existence of this | |
4438 structure, since @code{DEFUN} plus @code{DEFSUBR} takes care of all the | |
4439 details. | |
4440 | |
4441 @item min_args | |
4442 This is the minimum number of arguments that the function requires. The | |
4443 function @code{prog1} allows a minimum of one argument. | |
4444 | |
4445 @item max_args | |
4446 This is the maximum number of arguments that the function accepts, if | |
4447 there is a fixed maximum. Alternatively, it can be @code{UNEVALLED}, | |
4448 indicating a special form that receives unevaluated arguments, or | |
4449 @code{MANY}, indicating an unlimited number of evaluated arguments (the | |
4450 C equivalent of @code{&rest}). Both @code{UNEVALLED} and @code{MANY} | |
4451 are macros. If @var{max_args} is a number, it may not be less than | |
4452 @var{min_args} and it may not be greater than 8. (If you need to add a | |
4453 function with more than 8 arguments, use the @code{MANY} form. Resist | |
4454 the urge to edit the definition of @code{DEFUN} in @file{lisp.h}. If | |
4455 you do it anyways, make sure to also add another clause to the switch | |
4456 statement in @code{primitive_funcall().}) | |
4457 | |
4458 @item interactive | |
4459 This is an interactive specification, a string such as might be used as | |
4460 the argument of @code{interactive} in a Lisp function. In the case of | |
4461 @code{prog1}, it is 0 (a null pointer), indicating that @code{prog1} | |
4462 cannot be called interactively. A value of @code{""} indicates a | |
4463 function that should receive no arguments when called interactively. | |
4464 | |
4465 @item docstring | |
4466 This is the documentation string. It is written just like a | |
4467 documentation string for a function defined in Lisp; in particular, the | |
4468 first line should be a single sentence. Note how the documentation | |
4469 string is enclosed in a comment, none of the documentation is placed on | |
4470 the same lines as the comment-start and comment-end characters, and the | |
4471 comment-start characters are on the same line as the interactive | |
4472 specification. @file{make-docfile}, which scans the C files for | |
4473 documentation strings, is very particular about what it looks for, and | |
4474 will not properly extract the doc string if it's not in this exact format. | |
4475 | |
4476 In order to make both @file{etags} and @file{make-docfile} happy, make | |
4477 sure that the @code{DEFUN} line contains the @var{lname} and | |
4478 @var{fname}, and that the comment-start characters for the doc string | |
4479 are on the same line as the interactive specification, and put a newline | |
4480 directly after them (and before the comment-end characters). | |
4481 | |
4482 @item arglist | |
4483 This is the comma-separated list of arguments to the C function. For a | |
4484 function with a fixed maximum number of arguments, provide a C argument | |
4485 for each Lisp argument. In this case, unlike regular C functions, the | |
4486 types of the arguments are not declared; they are simply always of type | |
4487 @code{Lisp_Object}. | |
4488 | |
4489 The names of the C arguments will be used as the names of the arguments | |
4490 to the Lisp primitive as displayed in its documentation, modulo the same | |
4491 concerns described above for @code{F...} names (in particular, | |
4492 underscores in the C arguments become dashes in the Lisp arguments). | |
4493 | |
4494 There is one additional kludge: A trailing @samp{_} on the C argument is | |
4495 discarded when forming the Lisp argument. This allows C language | |
4496 reserved words (like @code{default}) or global symbols (like | |
4497 @code{dirname}) to be used as argument names without compiler warnings | |
4498 or errors. | |
4499 | |
4500 A Lisp function with @w{@var{max_args} = @code{UNEVALLED}} is a | |
4501 @w{@dfn{special form}}; its arguments are not evaluated. Instead it | |
4502 receives one argument of type @code{Lisp_Object}, a (Lisp) list of the | |
4503 unevaluated arguments, conventionally named @code{(args)}. | |
4504 | |
4505 When a Lisp function has no upper limit on the number of arguments, | |
4506 specify @w{@var{max_args} = @code{MANY}}. In this case its implementation in | |
4507 C actually receives exactly two arguments: the number of Lisp arguments | |
4508 (an @code{int}) and the address of a block containing their values (a | |
4509 @w{@code{Lisp_Object *}}). In this case only are the C types specified | |
4510 in the @var{arglist}: @w{@code{(int nargs, Lisp_Object *args)}}. | |
4511 | |
4512 @end table | |
4513 | |
4514 Within the function @code{Fprog1} itself, note the use of the macros | |
4515 @code{GCPRO1} and @code{UNGCPRO}. @code{GCPRO1} is used to ``protect'' | |
4516 a variable from garbage collection---to inform the garbage collector | |
4517 that it must look in that variable and regard the object pointed at by | |
4518 its contents as an accessible object. This is necessary whenever you | |
4519 call @code{Feval} or anything that can directly or indirectly call | |
4520 @code{Feval} (this includes the @code{QUIT} macro!). At such a time, | |
4521 any Lisp object that you intend to refer to again must be protected | |
4522 somehow. @code{UNGCPRO} cancels the protection of the variables that | |
4523 are protected in the current function. It is necessary to do this | |
4524 explicitly. | |
4525 | |
4526 The macro @code{GCPRO1} protects just one local variable. If you want | |
4527 to protect two, use @code{GCPRO2} instead; repeating @code{GCPRO1} will | |
4528 not work. Macros @code{GCPRO3} and @code{GCPRO4} also exist. | |
4529 | |
4530 These macros implicitly use local variables such as @code{gcpro1}; you | |
4531 must declare these explicitly, with type @code{struct gcpro}. Thus, if | |
4532 you use @code{GCPRO2}, you must declare @code{gcpro1} and @code{gcpro2}. | |
4533 | |
4534 @cindex caller-protects (@code{GCPRO} rule) | |
4535 Note also that the general rule is @dfn{caller-protects}; i.e. you are | |
4536 only responsible for protecting those Lisp objects that you create. Any | |
4537 objects passed to you as arguments should have been protected by whoever | |
4538 created them, so you don't in general have to protect them. | |
4539 | |
4540 In particular, the arguments to any Lisp primitive are always | |
4541 automatically @code{GCPRO}ed, when called ``normally'' from Lisp code or | |
4542 bytecode. So only a few Lisp primitives that are called frequently from | |
4543 C code, such as @code{Fprogn} protect their arguments as a service to | |
4544 their caller. You don't need to protect your arguments when writing a | |
4545 new @code{DEFUN}. | |
4546 | |
4547 @code{GCPRO}ing is perhaps the trickiest and most error-prone part of | |
4548 XEmacs coding. It is @strong{extremely} important that you get this | |
4549 right and use a great deal of discipline when writing this code. | |
4550 @xref{GCPROing, ,@code{GCPRO}ing}, for full details on how to do this. | |
4551 | |
4552 What @code{DEFUN} actually does is declare a global structure of type | |
4553 @code{Lisp_Subr} whose name begins with capital @samp{SF} and which | |
4554 contains information about the primitive (e.g. a pointer to the | |
4555 function, its minimum and maximum allowed arguments, a string describing | |
4556 its Lisp name); @code{DEFUN} then begins a normal C function declaration | |
4557 using the @code{F...} name. The Lisp subr object that is the function | |
4558 definition of a primitive (i.e. the object in the function slot of the | |
4559 symbol that names the primitive) actually points to this @samp{SF} | |
4560 structure; when @code{Feval} encounters a subr, it looks in the | |
4561 structure to find out how to call the C function. | |
4562 | |
4563 Defining the C function is not enough to make a Lisp primitive | |
4564 available; you must also create the Lisp symbol for the primitive (the | |
4565 symbol is @dfn{interned}; @pxref{Obarrays}) and store a suitable subr | |
4566 object in its function cell. (If you don't do this, the primitive won't | |
4567 be seen by Lisp code.) The code looks like this: | |
4568 | |
4569 @example | |
4570 DEFSUBR (@var{fname}); | |
4571 @end example | |
4572 | |
4573 @noindent | |
4574 Here @var{fname} is the same name you used as the second argument to | |
4575 @code{DEFUN}. | |
4576 | |
4577 This call to @code{DEFSUBR} should go in the @code{syms_of_*()} function | |
4578 at the end of the module. If no such function exists, create it and | |
4579 make sure to also declare it in @file{symsinit.h} and call it from the | |
4580 appropriate spot in @code{main()}. @xref{General Coding Rules}. | |
4581 | |
4582 Note that C code cannot call functions by name unless they are defined | |
4583 in C. The way to call a function written in Lisp from C is to use | |
4584 @code{Ffuncall}, which embodies the Lisp function @code{funcall}. Since | |
4585 the Lisp function @code{funcall} accepts an unlimited number of | |
4586 arguments, in C it takes two: the number of Lisp-level arguments, and a | |
4587 one-dimensional array containing their values. The first Lisp-level | |
4588 argument is the Lisp function to call, and the rest are the arguments to | |
4589 pass to it. Since @code{Ffuncall} can call the evaluator, you must | |
4590 protect pointers from garbage collection around the call to | |
4591 @code{Ffuncall}. (However, @code{Ffuncall} explicitly protects all of | |
4592 its parameters, so you don't have to protect any pointers passed as | |
4593 parameters to it.) | |
4594 | |
4595 The C functions @code{call0}, @code{call1}, @code{call2}, and so on, | |
4596 provide handy ways to call a Lisp function conveniently with a fixed | |
4597 number of arguments. They work by calling @code{Ffuncall}. | |
4598 | |
4599 @file{eval.c} is a very good file to look through for examples; | |
4600 @file{lisp.h} contains the definitions for important macros and | |
4601 functions. | |
4602 | |
4603 @node Writing Good Comments, Adding Global Lisp Variables, Writing Lisp Primitives, Rules When Writing New C Code | |
4604 @section Writing Good Comments | |
4605 @cindex writing good comments | |
4606 @cindex comments, writing good | |
4607 | |
4608 Comments are a lifeline for programmers trying to understand tricky | |
4609 code. In general, the less obvious it is what you are doing, the more | |
4610 you need a comment, and the more detailed it needs to be. You should | |
4611 always be on guard when you're writing code for stuff that's tricky, and | |
4612 should constantly be putting yourself in someone else's shoes and asking | |
4613 if that person could figure out without much difficulty what's going | |
4614 on. (Assume they are a competent programmer who understands the | |
4615 essentials of how the XEmacs code is structured but doesn't know much | |
4616 about the module you're working on or any algorithms you're using.) If | |
4617 you're not sure whether they would be able to, add a comment. Always | |
4618 err on the side of more comments, rather than less. | |
4619 | |
4620 Generally, when making comments, there is no need to attribute them with | |
4621 your name or initials. This especially goes for small, | |
4622 easy-to-understand, non-opinionated ones. Also, comments indicating | |
4623 where, when, and by whom a file was changed are @emph{strongly} | |
4624 discouraged, and in general will be removed as they are discovered. | |
4625 This is exactly what @file{ChangeLogs} are there for. However, it can | |
4626 occasionally be useful to mark exactly where (but not when or by whom) | |
4627 changes are made, particularly when making small changes to a file | |
4628 imported from elsewhere. These marks help when later on a newer version | |
4629 of the file is imported and the changes need to be merged. (If | |
4630 everything were always kept in CVS, there would be no need for this. | |
4631 But in practice, this often doesn't happen, or the CVS repository is | |
4632 later on lost or unavailable to the person doing the update.) | |
4633 | |
4634 When putting in an explicit opinion in a comment, you should | |
4635 @emph{always} attribute it with your name and the date. This also goes | |
4636 for long, complex comments explaining in detail the workings of | |
4637 something -- by putting your name there, you make it possible for | |
4638 someone who has questions about how that thing works to determine who | |
4639 wrote the comment so they can write to them. Use your actual name or | |
4640 your alias at xemacs.org, and not your initials or nickname, unless that | |
4641 is generally recognized (e.g. @samp{jwz}). Even then, please consider | |
4642 requesting a virtual user at xemacs.org (forwarding address; we can't | |
4643 provide an actual mailbox). Otherwise, give first and last name. If | |
4644 you're not a regular contributor, you might consider putting your email | |
4645 address in -- it may be in the ChangeLog, but after awhile ChangeLogs | |
4646 have a tendency of disappearing or getting muddled. (E.g. your comment | |
4647 may get copied somewhere else or even into another program, and tracking | |
4648 down the proper ChangeLog may be very difficult.) | |
4649 | |
4650 If you come across an opinion that is not or is no longer valid, or you | |
4651 come across any comment that no longer applies but you want to keep it | |
4652 around, enclose it in @samp{[[ } and @samp{ ]]} marks and add a comment | |
4653 afterwards explaining why the preceding comment is no longer valid. Put | |
4654 your name on this comment, as explained above. | |
4655 | |
4656 Just as comments are a lifeline to programmers, incorrect comments are | |
4657 death. If you come across an incorrect comment, @strong{immediately} | |
4658 correct it or flag it as incorrect, as described in the previous | |
4659 paragraph. Whenever you work on a section of code, @emph{always} make | |
4660 sure to update any comments to be correct -- or, at the very least, flag | |
4661 them as incorrect. | |
4662 | |
4663 To indicate a "todo" or other problem, use four pound signs -- | |
4664 i.e. @samp{####}. | |
4665 | |
4666 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code | |
4667 @section Adding Global Lisp Variables | |
4668 @cindex global Lisp variables, adding | |
4669 @cindex variables, adding global Lisp | |
4670 | |
4671 Global variables whose names begin with @samp{Q} are constants whose | |
4672 value is a symbol of a particular name. The name of the variable should | |
4673 be derived from the name of the symbol using the same rules as for Lisp | |
4674 primitives. These variables are initialized using a call to | |
4675 @code{defsymbol()} in the @code{syms_of_*()} function. (This call | |
4676 interns a symbol, sets the C variable to the resulting Lisp object, and | |
4677 calls @code{staticpro()} on the C variable to tell the | |
4678 garbage-collection mechanism about this variable. What | |
4679 @code{staticpro()} does is add a pointer to the variable to a large | |
4680 global array; when garbage-collection happens, all pointers listed in | |
4681 the array are used as starting points for marking Lisp objects. This is | |
4682 important because it's quite possible that the only current reference to | |
4683 the object is the C variable. In the case of symbols, the | |
4684 @code{staticpro()} doesn't matter all that much because the symbol is | |
4685 contained in @code{obarray}, which is itself @code{staticpro()}ed. | |
4686 However, it's possible that a naughty user could do something like | |
4687 uninterning the symbol out of @code{obarray} or even setting | |
4688 @code{obarray} to a different value [although this is likely to make | |
4689 XEmacs crash!].) | |
4690 | |
4691 @strong{Please note:} It is potentially deadly if you declare a | |
4692 @samp{Q...} variable in two different modules. The two calls to | |
4693 @code{defsymbol()} are no problem, but some linkers will complain about | |
4694 multiply-defined symbols. The most insidious aspect of this is that | |
4695 often the link will succeed anyway, but then the resulting executable | |
4696 will sometimes crash in obscure ways during certain operations! | |
4697 | |
4698 To avoid this problem, declare any symbols with common names (such as | |
4699 @code{text}) that are not obviously associated with this particular | |
4700 module in the file @file{general-slots.h}. The ``-slots'' suffix | |
4701 indicates that this is a file that is included multiple times in | |
4702 @file{general.c}. Redefinition of preprocessor macros allows the | |
4703 effects to be different in each context, so this is actually more | |
4704 convenient and less error-prone than doing it in your module. | |
4705 | |
4706 Global variables whose names begin with @samp{V} are variables that | |
4707 contain Lisp objects. The convention here is that all global variables | |
4708 of type @code{Lisp_Object} begin with @samp{V}, and all others don't | |
4709 (including integer and boolean variables that have Lisp | |
4710 equivalents). Most of the time, these variables have equivalents in | |
4711 Lisp, but some don't. Those that do are declared this way by a call to | |
4712 @code{DEFVAR_LISP()} in the @code{vars_of_*()} initializer for the | |
4713 module. What this does is create a special @dfn{symbol-value-forward} | |
4714 Lisp object that contains a pointer to the C variable, intern a symbol | |
4715 whose name is as specified in the call to @code{DEFVAR_LISP()}, and set | |
4716 its value to the symbol-value-forward Lisp object; it also calls | |
4717 @code{staticpro()} on the C variable to tell the garbage-collection | |
4718 mechanism about the variable. When @code{eval} (or actually | |
4719 @code{symbol-value}) encounters this special object in the process of | |
4720 retrieving a variable's value, it follows the indirection to the C | |
4721 variable and gets its value. @code{setq} does similar things so that | |
4722 the C variable gets changed. | |
4723 | |
4724 Whether or not you @code{DEFVAR_LISP()} a variable, you need to | |
4725 initialize it in the @code{vars_of_*()} function; otherwise it will end | |
4726 up as all zeroes, which is the integer 0 (@emph{not} @code{nil}), and | |
4727 this is probably not what you want. Also, if the variable is not | |
4728 @code{DEFVAR_LISP()}ed, @strong{you must call} @code{staticpro()} on the | |
4729 C variable in the @code{vars_of_*()} function. Otherwise, the | |
4730 garbage-collection mechanism won't know that the object in this variable | |
4731 is in use, and will happily collect it and reuse its storage for another | |
4732 Lisp object, and you will be the one who's unhappy when you can't figure | |
4733 out how your variable got overwritten. | |
4734 | |
4735 @node Writing Macros, Proper Use of Unsigned Types, Adding Global Lisp Variables, Rules When Writing New C Code | |
4736 @section Writing Macros | |
4737 @cindex writing macros | |
4738 @cindex macros, writing | |
4739 | |
4740 The three golden rules of macros: | |
4741 | |
4742 @enumerate | |
4743 @item | |
4744 Anything that's an lvalue can be evaluated more than once. | |
4745 @item | |
4746 Macros where anything else can be evaluated more than once should | |
4747 have the word "unsafe" in their name (exceptions may be made for | |
4748 large sets of macros that evaluate arguments of certain types more | |
4749 than once, e.g. struct buffer * arguments, when clearly indicated in | |
4750 the macro documentation). These macros are generally meant to be | |
4751 called only by other macros that have already stored the calling | |
4752 values in temporary variables. | |
4753 @item | |
4754 Nothing else can be evaluated more than once. Use inline | |
4755 functions, if necessary, to prevent multiple evaluation. | |
4756 @end enumerate | |
4757 | |
4758 NOTE: The functions and macros below are given full prototypes in their | |
4759 docs, even when the implementation is a macro. In such cases, passing | |
4760 an argument of a type other than expected will produce undefined | |
4761 results. Also, given that macros can do things functions can't (in | |
4762 particular, directly modify arguments as if they were passed by | |
4763 reference), the declaration syntax has been extended to include the | |
4764 call-by-reference syntax from C++, where an & after a type indicates | |
4765 that the argument is an lvalue and is passed by reference, i.e. the | |
4766 function can modify its value. (This is equivalent in C to passing a | |
4767 pointer to the argument, but without the need to explicitly worry about | |
4768 pointers.) | |
4769 | |
4770 When to capitalize macros: | |
4771 | |
4772 @itemize @bullet | |
4773 @item | |
4774 Capitalize macros doing stuff obviously impossible with (C) | |
4775 functions, e.g. directly modifying arguments as if they were passed by | |
4776 reference. | |
4777 @item | |
4778 Capitalize macros that evaluate @strong{any} argument more than once regardless | |
4779 of whether that's "allowed" (e.g. buffer arguments). | |
4780 @item | |
4781 Capitalize macros that directly access a field in a Lisp_Object or | |
4782 its equivalent underlying structure. In such cases, access through the | |
4783 Lisp_Object precedes the macro with an X, and access through the underlying | |
4784 structure doesn't. | |
4785 @item | |
4786 Capitalize certain other basic macros relating to Lisp_Objects; e.g. | |
4787 FRAMEP, CHECK_FRAME, etc. | |
4788 @item | |
4789 Try to avoid capitalizing any other macros. | |
4790 @end itemize | |
4791 | |
4792 @node Proper Use of Unsigned Types, Techniques for XEmacs Developers, Writing Macros, Rules When Writing New C Code | |
4793 @section Proper Use of Unsigned Types | |
4794 @cindex unsigned types, proper use of | |
4795 @cindex types, proper use of unsigned | |
4796 | |
4797 Avoid using @code{unsigned int} and @code{unsigned long} whenever | |
4798 possible. Unsigned types are viral -- any arithmetic or comparisons | |
4799 involving mixed signed and unsigned types are automatically converted to | |
4800 unsigned, which is almost certainly not what you want. Many subtle and | |
4801 hard-to-find bugs are created by careless use of unsigned types. In | |
4802 general, you should almost @emph{never} use an unsigned type to hold a | |
4803 regular quantity of any sort. The only exceptions are | |
4804 | |
4805 @enumerate | |
4806 @item | |
4807 When there's a reasonable possibility you will actually need all 32 or | |
4808 64 bits to store the quantity. | |
4809 @item | |
4810 When calling existing API's that require unsigned types. In this case, | |
4811 you should still do all manipulation using signed types, and do the | |
4812 conversion at the very threshold of the API call. | |
4813 @item | |
4814 In existing code that you don't want to modify because you don't | |
4815 maintain it. | |
4816 @item | |
4817 In bit-field structures. | |
4818 @end enumerate | |
4819 | |
4820 Other reasonable uses of @code{unsigned int} and @code{unsigned long} | |
4821 are representing non-quantities -- e.g. bit-oriented flags and such. | |
4822 | |
4823 @node Techniques for XEmacs Developers, , Proper Use of Unsigned Types, Rules When Writing New C Code | |
4824 @section Techniques for XEmacs Developers | |
4825 @cindex techniques for XEmacs developers | |
4826 @cindex developers, techniques for XEmacs | |
4827 | |
4828 @cindex Purify | |
4829 @cindex Quantify | |
4830 To make a purified XEmacs, do: @code{make puremacs}. | |
4831 To make a quantified XEmacs, do: @code{make quantmacs}. | |
4832 | |
4833 You simply can't dump Quantified and Purified images (unless using the | |
4834 portable dumper). Purify gets confused when xemacs frees memory in one | |
4835 process that was allocated in a @emph{different} process on a different | |
4836 machine! Run it like so: | |
4837 @example | |
4838 temacs -batch -l loadup.el run-temacs @var{xemacs-args...} | |
4839 @end example | |
4840 | |
4841 @cindex error checking | |
4842 Before you go through the trouble, are you compiling with all | |
4843 debugging and error-checking off? If not, try that first. Be warned | |
4844 that while Quantify is directly responsible for quite a few | |
4845 optimizations which have been made to XEmacs, doing a run which | |
4846 generates results which can be acted upon is not necessarily a trivial | |
4847 task. | |
4848 | |
4849 Also, if you're still willing to do some runs make sure you configure | |
4850 with the @samp{--quantify} flag. That will keep Quantify from starting | |
4851 to record data until after the loadup is completed and will shut off | |
4852 recording right before it shuts down (which generates enough bogus data | |
4853 to throw most results off). It also enables three additional elisp | |
4854 commands: @code{quantify-start-recording-data}, | |
4855 @code{quantify-stop-recording-data} and @code{quantify-clear-data}. | |
4856 | |
4857 If you want to make XEmacs faster, target your favorite slow benchmark, | |
4858 run a profiler like Quantify, @code{gprof}, or @code{tcov}, and figure | |
4859 out where the cycles are going. In many cases you can localize the | |
4860 problem (because a particular new feature or even a single patch | |
4861 elicited it). Don't hesitate to use brute force techniques like a | |
4862 global counter incremented at strategic places, especially in | |
4863 combination with other performance indications (@emph{e.g.}, degree of | |
4864 buffer fragmentation into extents). | |
4865 | |
4866 Specific projects: | |
4867 | |
4868 @itemize @bullet | |
4869 @item | |
4870 Make the garbage collector faster. Figure out how to write an | |
4871 incremental garbage collector. | |
4872 @item | |
4873 Write a compiler that takes bytecode and spits out C code. | |
4874 Unfortunately, you will then need a C compiler and a more fully | |
4875 developed module system. | |
4876 @item | |
4877 Speed up redisplay. | |
4878 @item | |
4879 Speed up syntax highlighting. It was suggested that ``maybe moving some | |
4880 of the syntax highlighting capabilities into C would make a | |
4881 difference.'' Wrong idea, I think. When processing one 400kB file a | |
4882 particular low-level routine was being called 40 @emph{million} times | |
4883 simply for @emph{one} call to @code{newline-and-indent}. Syntax | |
4884 highlighting needs to be rewritten to use a reliable, fast parser, then | |
4885 to trust the pre-parsed structure, and only do re-highlighting locally | |
4886 to a text change. Modern machines are fast enough to implement such | |
4887 parsers in Lisp; but no machine will ever be fast enough to deal with | |
4888 quadratic (or worse) algorithms! | |
4889 @item | |
4890 Implement tail recursion in Emacs Lisp (hard!). | |
4891 @end itemize | |
4892 | |
4893 Unfortunately, Emacs Lisp is slow, and is going to stay slow. Function | |
4894 calls in elisp are especially expensive. Iterating over a long list is | |
4895 going to be 30 times faster implemented in C than in Elisp. | |
4896 | |
4897 Heavily used small code fragments need to be fast. The traditional way | |
4898 to implement such code fragments in C is with macros. But macros in C | |
4899 are known to be broken. | |
4900 | |
4901 @cindex macro hygiene | |
4902 Macro arguments that are repeatedly evaluated may suffer from repeated | |
4903 side effects or suboptimal performance. | |
4904 | |
4905 Variable names used in macros may collide with caller's variables, | |
4906 causing (at least) unwanted compiler warnings. | |
4907 | |
4908 In order to solve these problems, and maintain statement semantics, one | |
4909 should use the @code{do @{ ... @} while (0)} trick while trying to | |
4910 reference macro arguments exactly once using local variables. | |
4911 | |
4912 Let's take a look at this poor macro definition: | |
4913 | |
4914 @example | |
4915 #define MARK_OBJECT(obj) \ | |
4916 if (!marked_p (obj)) mark_object (obj), did_mark = 1 | |
4917 @end example | |
4918 | |
4919 This macro evaluates its argument twice, and also fails if used like this: | |
4920 @example | |
4921 if (flag) MARK_OBJECT (obj); else @code{do_something()}; | |
4922 @end example | |
4923 | |
4924 A much better definition is | |
4925 | |
4926 @example | |
4927 #define MARK_OBJECT(obj) do @{ \ | |
4928 Lisp_Object mo_obj = (obj); \ | |
4929 if (!marked_p (mo_obj)) \ | |
4930 @{ \ | |
4931 mark_object (mo_obj); \ | |
4932 did_mark = 1; \ | |
4933 @} \ | |
4934 @} while (0) | |
4935 @end example | |
4936 | |
4937 Notice the elimination of double evaluation by using the local variable | |
4938 with the obscure name. Writing safe and efficient macros requires great | |
4939 care. The one problem with macros that cannot be portably worked around | |
4940 is, since a C block has no value, a macro used as an expression rather | |
4941 than a statement cannot use the techniques just described to avoid | |
4942 multiple evaluation. | |
4943 | |
4944 @cindex inline functions | |
4945 In most cases where a macro has function semantics, an inline function | |
4946 is a better implementation technique. Modern compiler optimizers tend | |
4947 to inline functions even if they have no @code{inline} keyword, and | |
4948 configure magic ensures that the @code{inline} keyword can be safely | |
4949 used as an additional compiler hint. Inline functions used in a single | |
4950 .c files are easy. The function must already be defined to be | |
4951 @code{static}. Just add another @code{inline} keyword to the | |
4952 definition. | |
4953 | |
4954 @example | |
4955 inline static int | |
4956 heavily_used_small_function (int arg) | |
4957 @{ | |
4958 ... | |
4959 @} | |
4960 @end example | |
4961 | |
4962 Inline functions in header files are trickier, because we would like to | |
4963 make the following optimization if the function is @emph{not} inlined | |
4964 (for example, because we're compiling for debugging). We would like the | |
4965 function to be defined externally exactly once, and each calling | |
4966 translation unit would create an external reference to the function, | |
4967 instead of including a definition of the inline function in the object | |
4968 code of every translation unit that uses it. This optimization is | |
4969 currently only available for gcc. But you don't have to worry about the | |
4970 trickiness; just define your inline functions in header files using this | |
4971 pattern: | |
4972 | |
4973 @example | |
4974 DECLARE_INLINE_HEADER ( | |
4975 int | |
4976 i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg) | |
4977 ) | |
4978 @{ | |
4979 ... | |
4980 @} | |
4981 @end example | |
4982 | |
4983 We use @code{DECLARE_INLINE_HEADER} rather than just the modifier | |
4984 @code{INLINE_HEADER} to prevent warnings when compiling with @code{gcc | |
4985 -Wmissing-declarations}. I consider issuing this warning for inline | |
4986 functions a gcc bug, but the gcc maintainers disagree. | |
4987 | |
4988 @cindex inline functions, headers | |
4989 @cindex header files, inline functions | |
4990 Every header which contains inline functions, either directly by using | |
4991 @code{DECLARE_INLINE_HEADER} or indirectly by using @code{DECLARE_LRECORD} must | |
4992 be added to @file{inline.c}'s includes to make the optimization | |
4993 described above work. (Optimization note: if all INLINE_HEADER | |
4994 functions are in fact inlined in all translation units, then the linker | |
4995 can just discard @code{inline.o}, since it contains only unreferenced code). | |
4996 | |
4997 To get started debugging XEmacs, take a look at the @file{.gdbinit} and | |
4998 @file{.dbxrc} files in the @file{src} directory. See the section in the | |
4999 XEmacs FAQ on How to Debug an XEmacs problem with a debugger. | |
5000 | |
5001 After making source code changes, run @code{make check} to ensure that | |
5002 you haven't introduced any regressions. If you want to make xemacs more | |
5003 reliable, please improve the test suite in @file{tests/automated}. | |
5004 | |
5005 Did you make sure you didn't introduce any new compiler warnings? | |
5006 | |
5007 Before submitting a patch, please try compiling at least once with | |
5008 | |
5009 @example | |
5010 configure --with-mule --use-union-type --error-checking=all | |
5011 @end example | |
5012 | |
5013 Here are things to know when you create a new source file: | |
5014 | |
5015 @itemize @bullet | |
5016 @item | |
5017 All @file{.c} files should @code{#include <config.h>} first. Almost all | |
5018 @file{.c} files should @code{#include "lisp.h"} second. | |
5019 | |
5020 @item | |
5021 Generated header files should be included using the @samp{#include <...>} | |
5022 syntax, not the @samp{#include "..."} syntax. The generated headers are: | |
5023 | |
5024 @file{config.h sheap-adjust.h paths.h Emacs.ad.h} | |
5025 | |
5026 The basic rule is that you should assume builds using @samp{--srcdir} | |
5027 and the @samp{#include <...>} syntax needs to be used when the | |
5028 to-be-included generated file is in a potentially different directory | |
5029 @emph{at compile time}. The non-obvious C rule is that | |
5030 @samp{#include "..."} means to search for the included file in the same | |
5031 directory as the including file, @emph{not} in the current directory. | |
5032 Normally this is not a problem but when building with @samp{--srcdir}, | |
5033 @file{make} will search the @samp{VPATH} for you, while the C compiler | |
5034 knows nothing about it. | |
5035 | |
5036 @item | |
5037 Header files should @emph{not} include @samp{<config.h>} and | |
5038 @samp{"lisp.h"}. It is the responsibility of the @file{.c} files that | |
5039 use it to do so. | |
5040 | |
5041 @end itemize | |
5042 | |
5043 @cindex Lisp object types, creating | |
5044 @cindex creating Lisp object types | |
5045 @cindex object types, creating Lisp | |
5046 Here is a checklist of things to do when creating a new lisp object type | |
5047 named @var{foo}: | |
5048 | |
5049 @enumerate | |
5050 @item | |
5051 create @var{foo}.h | |
5052 @item | |
5053 create @var{foo}.c | |
5054 @item | |
5055 add definitions of @code{syms_of_@var{foo}}, etc. to @file{@var{foo}.c} | |
5056 @item | |
5057 add declarations of @code{syms_of_@var{foo}}, etc. to @file{symsinit.h} | |
5058 @item | |
5059 add calls to @code{syms_of_@var{foo}}, etc. to @file{emacs.c} | |
5060 @item | |
5061 add definitions of macros like @code{CHECK_@var{FOO}} and | |
5062 @code{@var{FOO}P} to @file{@var{foo}.h} | |
5063 @item | |
5064 add the new type index to @code{enum lrecord_type} | |
5065 @item | |
5066 add a DEFINE_LRECORD_IMPLEMENTATION call to @file{@var{foo}.c} | |
5067 @item | |
5068 add an INIT_LRECORD_IMPLEMENTATION call to @code{syms_of_@var{foo}.c} | |
5069 @end enumerate | |
5070 | |
5071 @node Regression Testing XEmacs, CVS Techniques, Rules When Writing New C Code, Top | |
5072 @chapter Regression Testing XEmacs | |
5073 @cindex testing, regression | |
5074 | |
5075 @menu | |
5076 * How to Regression-Test:: | |
5077 * Modules for Regression Testing:: | |
5078 @end menu | |
5079 | |
5080 @node How to Regression-Test, Modules for Regression Testing, Regression Testing XEmacs, Regression Testing XEmacs | |
5081 @section How to Regression-Test | |
5082 @cindex how to regression-test | |
5083 @cindex regression-test, how to | |
5084 @cindex testing, regression, how to | |
5085 | |
5086 The source directory @file{tests/automated} contains XEmacs' automated | |
5087 test suite. The usual way of running all the tests is running | |
5088 @code{make check} from the top-level build directory. | |
5089 | |
5090 The test suite is unfinished and it's still lacking some essential | |
5091 features. It is nevertheless recommended that you run the tests to | |
5092 confirm that XEmacs behaves correctly. | |
5093 | |
5094 If you want to run a specific test case, you can do it from the | |
5095 command-line like this: | |
5096 | |
5097 @example | |
5098 $ xemacs -batch -l test-harness.elc -f batch-test-emacs TEST-FILE | |
5099 @end example | |
5100 | |
5101 If a test fails and you need more information, you can run the test | |
5102 suite interactively by loading @file{test-harness.el} into a running | |
5103 XEmacs and typing @kbd{M-x test-emacs-test-file RET <filename> RET}. | |
5104 You will see a log of passed and failed tests, which should allow you to | |
5105 investigate the source of the error and ultimately fix the bug. If you | |
5106 are not capable of, or don't have time for, debugging it yourself, | |
5107 please do report the failures using @kbd{M-x report-emacs-bug} or | |
5108 @kbd{M-x build-report}. | |
5109 | |
5110 @deffn Command test-emacs-test-file file | |
5111 Runs the tests in @var{file}. @file{test-harness.el} must be loaded. | |
5112 Defines all the macros described in this node, and undefines them when | |
5113 done. | |
5114 @end deffn | |
5115 | |
5116 Adding a new test file is trivial: just create a new file here and it | |
5117 will be run. There is no need to byte-compile any of the files in | |
5118 this directory---the test-harness will take care of any necessary | |
5119 byte-compilation. | |
5120 | |
5121 Look at the existing test cases for the examples of coding test cases. | |
5122 It all boils down to your imagination and judicious use of the macros | |
5123 @code{Assert}, @code{Check-Error}, @code{Check-Error-Message}, and | |
5124 @code{Check-Message}. Note that all of these macros are defined only | |
5125 for the duration of the test: they do not exist in the global | |
5126 environment. | |
5127 | |
5128 @deffn Macro Assert expr | |
5129 Check that @var{expr} is non-nil at this point in the test. | |
5130 @end deffn | |
5131 | |
5132 @deffn Macro Check-Error expected-error body | |
5133 Check that execution of @var{body} causes @var{expected-error} to be | |
5134 signaled. @var{body} is a @code{progn}-like body, and may contain | |
5135 several expressions. @var{expected-error} is a symbol defined as | |
5136 an error by @code{define-error}. | |
5137 @end deffn | |
5138 | |
5139 @deffn Macro Check-Error-Message expected-error expected-error-regexp body | |
5140 Check that execution of @var{body} causes @var{expected-error} to be | |
5141 signaled, and generate a message matching @var{expected-error-regexp}. | |
5142 @var{body} is a @code{progn}-like body, and may contain several | |
5143 expressions. @var{expected-error} is a symbol defined as an error | |
5144 by @code{define-error}. | |
5145 @end deffn | |
5146 | |
5147 @deffn Macro Check-Message expected-message body | |
5148 Check that execution of @var{body} causes @var{expected-message} to be | |
5149 generated (using @code{message} or a similar function). @var{body} is a | |
5150 @code{progn}-like body, and may contain several expressions. | |
5151 @end deffn | |
5152 | |
5153 Here's a simple example checking case-sensitive and case-insensitive | |
5154 comparisons from @file{case-tests.el}. | |
5155 | |
5156 @example | |
5157 (with-temp-buffer | |
5158 (insert "Test Buffer") | |
5159 (let ((case-fold-search t)) | |
5160 (goto-char (point-min)) | |
5161 (Assert (eq (search-forward "test buffer" nil t) 12)) | |
5162 (goto-char (point-min)) | |
5163 (Assert (eq (search-forward "Test buffer" nil t) 12)) | |
5164 (goto-char (point-min)) | |
5165 (Assert (eq (search-forward "Test Buffer" nil t) 12)) | |
5166 | |
5167 (setq case-fold-search nil) | |
5168 (goto-char (point-min)) | |
5169 (Assert (not (search-forward "test buffer" nil t))) | |
5170 (goto-char (point-min)) | |
5171 (Assert (not (search-forward "Test buffer" nil t))) | |
5172 (goto-char (point-min)) | |
5173 (Assert (eq (search-forward "Test Buffer" nil t) 12)))) | |
5174 @end example | |
5175 | |
5176 This example could be saved in a file in @file{tests/automated}, and it | |
5177 would constitute a complete test, automatically executed when you run | |
5178 @kbd{make check} after building XEmacs. More complex tests may require | |
5179 substantial temporary scaffolding to create the environment that elicits | |
5180 the bugs, but the top-level @file{Makefile} and @file{test-harness.el} | |
5181 handle the running and collection of results from the @code{Assert}, | |
5182 @code{Check-Error}, @code{Check-Error-Message}, and @code{Check-Message} | |
5183 macros. | |
5184 | |
5185 Don't suppress tests just because they're due to known bugs not yet | |
5186 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark | |
5187 them. | |
5188 | |
5189 @deffn Macro Known-Bug-Expect-Failure body | |
5190 Arrange for failing tests in @var{body} to generate messages prefixed | |
5191 with "KNOWN BUG:" instead of "FAIL:". @var{body} is a @code{progn}-like | |
5192 body, and may contain several tests. | |
5193 @end deffn | |
5194 | |
5195 A lot of the tests we run push limits; suppress Ebola warning messages | |
5196 with the @code{Ignore-Ebola} wrapper macro. | |
5197 | |
5198 @deffn Macro Ignore-Ebola body | |
5199 Suppress Ebola warning messages while running tests in @var{body}. | |
5200 @var{body} is a @code{progn}-like body, and may contain several tests. | |
5201 @end deffn | |
5202 | |
5203 Both macros are defined temporarily within the test function. Simple | |
5204 examples: | |
5205 | |
5206 @example | |
5207 ;; Apparently Ignore-Ebola is a solution with no problem to address. | |
5208 ;; There are no examples in 21.5, anyway. | |
5209 | |
5210 ;; from regexp-tests.el | |
5211 (Known-Bug-Expect-Failure | |
5212 (Assert (not (string-match "\\b" ""))) | |
5213 (Assert (not (string-match " \\b" " ")))) | |
5214 @end example | |
5215 | |
5216 In general, you should avoid using functionality from packages in your | |
5217 tests, because you can't be sure that everyone will have the required | |
5218 package. However, if you've got a test that works, by all means add it. | |
5219 Simply wrap the test in an appropriate test, add a notice that the test | |
5220 was skipped, and update the @code{skipped-test-reasons} hashtable. The | |
5221 wrapper macro @code{Skip-Test-Unless} is provided to handle common | |
5222 cases. | |
5223 | |
5224 @defvar skipped-test-reasons | |
5225 Hash table counting the number of times a particular reason is given for | |
5226 skipping tests. This is only defined within @code{test-emacs-test-file}. | |
5227 @end defvar | |
5228 | |
5229 @deffn Macro Skip-Test-Unless prerequisite reason description body | |
5230 @var{prerequisite} is usually a feature test (@code{featurep}, | |
5231 @code{boundp}, @code{fboundp}). @var{reason} is a string describing the | |
5232 prerequisite; it must be unique because it is used as a hash key in a | |
5233 table of reasons for skipping tests. @var{description} describes the | |
5234 tests being skipped, for the test result summary. @var{body} is a | |
5235 @code{progn}-like body, and may contain several tests. | |
5236 @end deffn | |
5237 | |
5238 @code{Skip-Test-Unless} is defined temporarily within the test function. | |
5239 Here's an example of usage from @file{syntax-tests.el}: | |
5240 | |
5241 @example | |
5242 ;; Test forward-comment at buffer boundaries | |
5243 (with-temp-buffer | |
5244 ;; try to use exactly what you need: featurep, boundp, fboundp | |
5245 (Skip-Test-Unless (fboundp 'c-mode) | |
5246 "c-mode unavailable" | |
5247 "comment and parse-partial-sexp tests" | |
5248 ;; and here's the test code | |
5249 (c-mode) | |
5250 (insert "// comment\n") | |
5251 (forward-comment -2) | |
5252 (Assert (eq (point) (point-min))) | |
5253 (let ((point (point))) | |
5254 (insert "/* comment */") | |
5255 (goto-char point) | |
5256 (forward-comment 2) | |
5257 (Assert (eq (point) (point-max))) | |
5258 (parse-partial-sexp point (point-max))))) | |
5259 @end example | |
5260 | |
5261 @code{Skip-Test-Unless} is intended for use with features that are normally | |
5262 present in typical configurations. For truly optional features, or | |
5263 tests that apply to one of several alternative implementations (eg, to | |
5264 GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply | |
5265 silently suppress the test if the feature is not available. | |
5266 | |
5267 Here are a few general hints for writing tests. | |
5268 | |
5269 @enumerate | |
5270 @item | |
5271 Include related successful cases. Fixes often break something. | |
5272 | |
5273 @item | |
5274 Use the Known-Bug-Expect-Failure macro to mark the cases you know | |
5275 are going to fail. We want to be able to distinguish between | |
5276 regressions and other unexpected failures, and cases that have | |
5277 been (partially) analyzed but not yet repaired. | |
5278 | |
5279 @item | |
5280 Mark the bug with the date of report. An ``Unfixed since yyyy-mm-dd'' | |
5281 gloss for Known-Bug-Expect-Failure is planned to further increase | |
5282 developer embarrassment (== incentive to fix the bug), but until then at | |
5283 least put a comment about the date so we can easily see when it was | |
5284 first reported. | |
5285 | |
5286 @item | |
5287 It's a matter of your judgement, but you should often use generic tests | |
5288 (@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for | |
5289 numbers) even though you know that arguments ``should'' be of correct | |
5290 type. That is, if the functions used can return generic objects | |
5291 (typically @code{nil}), as well as some more specific type that will be | |
5292 returned on success. We don't want failures of those assertions | |
5293 reported as ``other failures'' (a wrong-type-arg signal, rather than a | |
5294 null return), we want them reported as ``assertion failures.'' | |
5295 | |
5296 One example is a test that tests @code{(= (string-match this that) 0)}, | |
5297 expecting a successful match. Now suppose @code{string-match} is broken | |
5298 such that the match fails. Then it will return @code{nil}, and @code{=} | |
5299 will signal ``wrong-type-argument, number-char-or-marker-p, nil'', | |
5300 generating an ``other failure'' in the report. But this should be | |
5301 reported as an assertion failure (the test failed in a foreseeable way), | |
5302 rather than something else (we don't know what happened because XEmacs | |
5303 is broken in a way that we weren't trying to test!) | |
5304 @end enumerate | |
5305 | |
5306 @node Modules for Regression Testing, , How to Regression-Test, Regression Testing XEmacs | |
5307 @section Modules for Regression Testing | |
5308 @cindex modules for regression testing | |
5309 @cindex regression testing, modules for | |
5310 | |
5311 @example | |
5312 @file{test-harness.el} | |
5313 @file{base64-tests.el} | |
5314 @file{byte-compiler-tests.el} | |
5315 @file{case-tests.el} | |
5316 @file{ccl-tests.el} | |
5317 @file{c-tests.el} | |
5318 @file{database-tests.el} | |
5319 @file{extent-tests.el} | |
5320 @file{hash-table-tests.el} | |
5321 @file{lisp-tests.el} | |
5322 @file{md5-tests.el} | |
5323 @file{mule-tests.el} | |
5324 @file{regexp-tests.el} | |
5325 @file{symbol-tests.el} | |
5326 @file{syntax-tests.el} | |
5327 @file{tag-tests.el} | |
5328 @file{weak-tests.el} | |
5329 @end example | |
5330 | |
5331 @file{test-harness.el} defines the macros @code{Assert}, | |
5332 @code{Check-Error}, @code{Check-Error-Message}, and | |
5333 @code{Check-Message}. The other files are test files, testing various | |
5334 XEmacs facilities. @xref{Regression Testing XEmacs}. | |
5335 | |
5336 | |
5337 @node CVS Techniques, XEmacs from the Inside, Regression Testing XEmacs, Top | |
5338 @chapter CVS Techniques | |
5339 @cindex CVS techniques | |
5340 | |
5341 @menu | |
5342 * Merging a Branch into the Trunk:: | |
5343 @end menu | |
5344 | |
5345 @node Merging a Branch into the Trunk, , CVS Techniques, CVS Techniques | |
5346 @section Merging a Branch into the Trunk | |
5347 @cindex merging a branch into the trunk | |
5348 | |
5349 @enumerate | |
5350 @item | |
5351 If you haven't already done a merge, you will be merging from the branch | |
5352 point; otherwise you'll be merging from the last merge point, which | |
5353 should be marked by a tag, e.g. @samp{last-sync-ben-mule-21-5}. In the | |
5354 former case, create the last-sync tag, e.g. | |
5355 | |
5356 @example | |
5357 crw rtag -r ben-mule-21-5-bp last-sync-ben-mule-21-5 xemacs | |
5358 @end example | |
5359 | |
5360 (You did create a branch point tag when you created the branch, didn't | |
5361 you?) | |
5362 | |
5363 @item | |
5364 Check everything in on your branch. | |
5365 | |
5366 @item | |
5367 Tag your branch with a pre-sync tag, e.g. | |
5368 | |
5369 @example | |
5370 crw rtag -r ben-mule-21-5 ben-mule-21-5-pre-feb-20-2002-sync xemacs | |
5371 @end example | |
5372 | |
5373 Note, you need to use rtag and specify a version with @samp{-r} (use | |
5374 @samp{-r HEAD} if necessary) so that removed files are handled correctly | |
5375 in some obscure cases. See section 4.8 of the CVS manual. | |
5376 | |
5377 @item | |
5378 Tag the trunk so you have a stable place to merge up to in case people | |
5379 are asynchronously committing to the trunk, e.g. | |
5380 | |
5381 @example | |
5382 crw rtag -r HEAD main-branch-ben-mule-21-5-syncpoint-feb-20-2002 xemacs | |
5383 crw rtag -F -r main-branch-ben-mule-21-5-syncpoint-feb-20-2002 next-sync-ben-mule-21-5 xemacs | |
5384 @end example | |
5385 | |
5386 Use -F in the second case because the name might already exist, e.g. if | |
5387 you've already done a merge. We make two tags because one is a | |
5388 permanent mark indicating a syncpoint when merging, and the other is a | |
5389 symbolic tag to make other operations easier. | |
5390 | |
5391 @item | |
5392 Make a backup of your source tree (not totally necessary but useful for | |
5393 reference and peace of mind): Move one level up from the top directory | |
5394 of your branch and do, e.g. | |
5395 | |
5396 @example | |
5397 cp -a mule mule-backup-2-23-02 | |
5398 @end example | |
5399 | |
5400 @item | |
5401 Now, we're ready to merge! Make sure you're in the top directory of | |
5402 your branch and do, e.g. | |
5403 | |
5404 @example | |
5405 cvs update -j last-sync-ben-mule-21-5 -j next-sync-ben-mule-21-5 | |
5406 @end example | |
5407 | |
5408 @item | |
5409 Fix all merge conflicts. Get the sucker to compile and run. | |
5410 | |
5411 @item | |
5412 Tag your branch with a post-sync tag, e.g. | |
5413 | |
5414 @example | |
5415 crw rtag -r ben-mule-21-5 ben-mule-21-5-post-feb-20-2002-sync xemacs | |
5416 @end example | |
5417 | |
5418 @item | |
5419 Update the last-sync tag, e.g. | |
5420 | |
5421 @example | |
5422 crw rtag -F -r next-sync-ben-mule-21-5 last-sync-ben-mule-21-5 xemacs | |
5423 @end example | |
5424 @end enumerate | |
5425 | |
5426 | |
5427 @node XEmacs from the Inside, The XEmacs Object System (Abstractly Speaking), CVS Techniques, Top | |
5428 @chapter XEmacs from the Inside | |
5429 @cindex XEmacs from the inside | |
5430 @cindex inside, XEmacs from the | |
5431 | |
5432 Internally, XEmacs is quite complex, and can be very confusing. To | |
5433 simplify things, it can be useful to think of XEmacs as containing an | |
5434 event loop that ``drives'' everything, and a number of other subsystems, | |
5435 such as a Lisp engine and a redisplay mechanism. Each of these other | |
5436 subsystems exists simultaneously in XEmacs, and each has a certain | |
5437 state. The flow of control continually passes in and out of these | |
5438 different subsystems in the course of normal operation of the editor. | |
5439 | |
5440 It is important to keep in mind that, most of the time, the editor is | |
5441 ``driven'' by the event loop. Except during initialization and batch | |
5442 mode, all subsystems are entered directly or indirectly through the | |
5443 event loop, and ultimately, control exits out of all subsystems back up | |
5444 to the event loop. This cycle of entering a subsystem, exiting back out | |
5445 to the event loop, and starting another iteration of the event loop | |
5446 occurs once each keystroke, mouse motion, etc. | |
5447 | |
5448 If you're trying to understand a particular subsystem (other than the | |
5449 event loop), think of it as a ``daemon'' process or ``servant'' that is | |
5450 responsible for one particular aspect of a larger system, and | |
5451 periodically receives commands or environment changes that cause it to | |
5452 do something. Ultimately, these commands and environment changes are | |
5453 always triggered by the event loop. For example: | |
5454 | |
5455 @itemize @bullet | |
5456 @item | |
5457 The window and frame mechanism is responsible for keeping track of what | |
5458 windows and frames exist, what buffers are in them, etc. It is | |
5459 periodically given commands (usually from the user) to make a change to | |
5460 the current window/frame state: i.e. create a new frame, delete a | |
5461 window, etc. | |
5462 | |
5463 @item | |
5464 The buffer mechanism is responsible for keeping track of what buffers | |
5465 exist and what text is in them. It is periodically given commands | |
5466 (usually from the user) to insert or delete text, create a buffer, etc. | |
5467 When it receives a text-change command, it notifies the redisplay | |
5468 mechanism. | |
5469 | |
5470 @item | |
5471 The redisplay mechanism is responsible for making sure that windows and | |
5472 frames are displayed correctly. It is periodically told (by the event | |
5473 loop) to actually ``do its job'', i.e. snoop around and see what the | |
5474 current state of the environment (mostly of the currently-existing | |
5475 windows, frames, and buffers) is, and make sure that state matches | |
5476 what's actually displayed. It keeps lots and lots of information around | |
5477 (such as what is actually being displayed currently, and what the | |
5478 environment was last time it checked) so that it can minimize the work | |
5479 it has to do. It is also helped along in that whenever a relevant | |
5480 change to the environment occurs, the redisplay mechanism is told about | |
5481 this, so it has a pretty good idea of where it has to look to find | |
5482 possible changes and doesn't have to look everywhere. | |
5483 | |
5484 @item | |
5485 The Lisp engine is responsible for executing the Lisp code in which most | |
5486 user commands are written. It is entered through a call to @code{eval} | |
5487 or @code{funcall}, which occurs as a result of dispatching an event from | |
5488 the event loop. The functions it calls issue commands to the buffer | |
5489 mechanism, the window/frame subsystem, etc. | |
5490 | |
5491 @item | |
5492 The Lisp allocation subsystem is responsible for keeping track of Lisp | |
5493 objects. It is given commands from the Lisp engine to allocate objects, | |
5494 garbage collect, etc. | |
5495 @end itemize | |
5496 | |
5497 etc. | |
5498 | |
5499 The important idea here is that there are a number of independent | |
5500 subsystems each with its own responsibility and persistent state, just | |
5501 like different employees in a company, and each subsystem is | |
5502 periodically given commands from other subsystems. Commands can flow | |
5503 from any one subsystem to any other, but there is usually some sort of | |
5504 hierarchy, with all commands originating from the event subsystem. | |
5505 | |
5506 XEmacs is entered in @code{main()}, which is in @file{emacs.c}. When | |
5507 this is called the first time (in a properly-invoked @file{temacs}), it | |
5508 does the following: | |
5509 | |
5510 @enumerate | |
5511 @item | |
5512 It does some very basic environment initializations, such as determining | |
5513 where it and its directories (e.g. @file{lisp/} and @file{etc/}) reside | |
5514 and setting up signal handlers. | |
5515 @item | |
5516 It initializes the entire Lisp interpreter. | |
5517 @item | |
5518 It sets the initial values of many built-in variables (including many | |
5519 variables that are visible to Lisp programs), such as the global keymap | |
5520 object and the built-in faces (a face is an object that describes the | |
5521 display characteristics of text). This involves creating Lisp objects | |
5522 and thus is dependent on step (2). | |
5523 @item | |
5524 It performs various other initializations that are relevant to the | |
5525 particular environment it is running in, such as retrieving environment | |
5526 variables, determining the current date and the user who is running the | |
5527 program, examining its standard input, creating any necessary file | |
5528 descriptors, etc. | |
5529 @item | |
5530 At this point, the C initialization is complete. A Lisp program that | |
5531 was specified on the command line (usually @file{loadup.el}) is called | |
5532 (temacs is normally invoked as @code{temacs -batch -l loadup.el dump}). | |
5533 @file{loadup.el} loads all of the other Lisp files that are needed for | |
5534 the operation of the editor, calls the @code{dump-emacs} function to | |
5535 write out @file{xemacs}, and then kills the temacs process. | |
5536 @end enumerate | |
5537 | |
5538 When @file{xemacs} is then run, it only redoes steps (1) and (4) | |
5539 above; all variables already contain the values they were set to when | |
5540 the executable was dumped, and all memory that was allocated with | |
5541 @code{malloc()} is still around. (XEmacs knows whether it is being run | |
5542 as @file{xemacs} or @file{temacs} because it sets the global variable | |
5543 @code{initialized} to 1 after step (4) above.) At this point, | |
5544 @file{xemacs} calls a Lisp function to do any further initialization, | |
5545 which includes parsing the command-line (the C code can only do limited | |
5546 command-line parsing, which includes looking for the @samp{-batch} and | |
5547 @samp{-l} flags and a few other flags that it needs to know about before | |
5548 initialization is complete), creating the first frame (or @dfn{window} | |
5549 in standard window-system parlance), running the user's init file | |
5550 (usually the file @file{.emacs} in the user's home directory), etc. The | |
5551 function to do this is usually called @code{normal-top-level}; | |
5552 @file{loadup.el} tells the C code about this function by setting its | |
5553 name as the value of the Lisp variable @code{top-level}. | |
5554 | |
5555 When the Lisp initialization code is done, the C code enters the event | |
5556 loop, and stays there for the duration of the XEmacs process. The code | |
5557 for the event loop is contained in @file{cmdloop.c}, and is called | |
5558 @code{Fcommand_loop_1()}. Note that this event loop could very well be | |
5559 written in Lisp, and in fact a Lisp version exists; but apparently, | |
5560 doing this makes XEmacs run noticeably slower. | |
5561 | |
5562 Notice how much of the initialization is done in Lisp, not in C. | |
5563 In general, XEmacs tries to move as much code as is possible | |
5564 into Lisp. Code that remains in C is code that implements the | |
5565 Lisp interpreter itself, or code that needs to be very fast, or | |
5566 code that needs to do system calls or other such stuff that | |
5567 needs to be done in C, or code that needs to have access to | |
5568 ``forbidden'' structures. (One conscious aspect of the design of | |
5569 Lisp under XEmacs is a clean separation between the external | |
5570 interface to a Lisp object's functionality and its internal | |
5571 implementation. Part of this design is that Lisp programs | |
5572 are forbidden from accessing the contents of the object other | |
5573 than through using a standard API. In this respect, XEmacs Lisp | |
5574 is similar to modern Lisp dialects but differs from GNU Emacs, | |
5575 which tends to expose the implementation and allow Lisp | |
5576 programs to look at it directly. The major advantage of | |
5577 hiding the implementation is that it allows the implementation | |
5578 to be redesigned without affecting any Lisp programs, including | |
5579 those that might want to be ``clever'' by looking directly at | |
5580 the object's contents and possibly manipulating them.) | |
5581 | |
5582 Moving code into Lisp makes the code easier to debug and maintain and | |
5583 makes it much easier for people who are not XEmacs developers to | |
5584 customize XEmacs, because they can make a change with much less chance | |
5585 of obscure and unwanted interactions occurring than if they were to | |
5586 change the C code. | |
5587 | |
5588 @node The XEmacs Object System (Abstractly Speaking), How Lisp Objects Are Represented in C, XEmacs from the Inside, Top | |
5589 @chapter The XEmacs Object System (Abstractly Speaking) | |
5590 @cindex XEmacs object system (abstractly speaking), the | |
5591 @cindex object system (abstractly speaking), the XEmacs | |
5592 | |
5593 At the heart of the Lisp interpreter is its management of objects. | |
5594 XEmacs Lisp contains many built-in objects, some of which are | |
5595 simple and others of which can be very complex; and some of which | |
5596 are very common, and others of which are rarely used or are only | |
5597 used internally. (Since the Lisp allocation system, with its | |
5598 automatic reclamation of unused storage, is so much more convenient | |
5599 than @code{malloc()} and @code{free()}, the C code makes extensive use of it | |
5600 in its internal operations.) | |
5601 | |
5602 The basic Lisp objects are | |
5603 | |
5604 @table @code | |
5605 @item integer | |
5606 31 bits of precision, or 63 bits on 64-bit machines; the | |
5607 reason for this is described below when the internal Lisp object | |
5608 representation is described. | |
5609 @item char | |
5610 An object representing a single character of text; chars behave like | |
5611 integers in many ways but are logically considered text rather than | |
5612 numbers and have a different read syntax. (the read syntax for a char | |
5613 contains the char itself or some textual encoding of it---for example, | |
5614 a Japanese Kanji character might be encoded as @samp{^[$(B#&^[(B} using the | |
5615 ISO-2022 encoding standard---rather than the numerical representation | |
5616 of the char; this way, if the mapping between chars and integers | |
5617 changes, which is quite possible for Kanji characters and other extended | |
5618 characters, the same character will still be created. Note that some | |
5619 primitives confuse chars and integers. The worst culprit is @code{eq}, | |
5620 which makes a special exception and considers a char to be @code{eq} to | |
5621 its integer equivalent, even though in no other case are objects of two | |
5622 different types @code{eq}. The reason for this monstrosity is | |
5623 compatibility with existing code; the separation of char from integer | |
5624 came fairly recently.) | |
5625 @item float | |
5626 Same precision as a double in C. | |
5627 @item bignum | |
5628 @itemx ratio | |
5629 @itemx bigfloat | |
5630 As build-time options, arbitrary-precision numbers are available. | |
5631 Bignums are integers, and when available they remove the restriction on | |
5632 buffer size. Ratios are non-integral rational numbers. Bigfloats are | |
5633 arbitrary-precision floating point numbers, with precision specified at | |
5634 runtime. | |
5635 @item symbol | |
5636 An object that contains Lisp objects and is referred to by name; | |
5637 symbols are used to implement variables and named functions | |
5638 and to provide the equivalent of preprocessor constants in C. | |
5639 @item string | |
5640 Self-explanatory; behaves much like a vector of chars | |
5641 but has a different read syntax and is stored and manipulated | |
5642 more compactly. | |
5643 @item bit-vector | |
5644 A vector of bits; similar to a string in spirit. | |
5645 @item vector | |
5646 A one-dimensional array of Lisp objects providing constant-time access | |
5647 to any of the objects; access to an arbitrary object in a vector is | |
5648 faster than for lists, but the operations that can be done on a vector | |
5649 are more limited. | |
5650 @item compiled-function | |
5651 An object containing compiled Lisp code, known as @dfn{byte code}. | |
5652 @item subr | |
5653 A Lisp primitive, i.e. a Lisp-callable function implemented in C. | |
5654 @item cons | |
5655 A simple container for two Lisp objects, used to implement lists and | |
5656 most other data structures in Lisp. | |
5657 @end table | |
5658 | |
5659 Objects which are not conses are called atoms. | |
5660 | |
5661 @cindex closure | |
5662 Note that there is no basic ``function'' type, as in more powerful | |
5663 versions of Lisp (where it's called a @dfn{closure}). XEmacs Lisp does | |
5664 not provide the closure semantics implemented by Common Lisp and Scheme. | |
5665 The guts of a function in XEmacs Lisp are represented in one of four | |
5666 ways: a symbol specifying another function (when one function is an | |
5667 alias for another), a list (whose first element must be the symbol | |
5668 @code{lambda}) containing the function's source code, a | |
5669 compiled-function object, or a subr object. (In other words, given a | |
5670 symbol specifying the name of a function, calling @code{symbol-function} | |
5671 to retrieve the contents of the symbol's function cell will return one | |
5672 of these types of objects.) | |
5673 | |
5674 XEmacs Lisp also contains numerous specialized objects used to implement | |
5675 the editor: | |
5676 | |
5677 @table @code | |
5678 @item buffer | |
5679 Stores text like a string, but is optimized for insertion and deletion | |
5680 and has certain other properties that can be set. | |
5681 @item frame | |
5682 An object with various properties whose displayable representation is a | |
5683 @dfn{window} in window-system parlance. | |
5684 @item window | |
5685 A section of a frame that displays the contents of a buffer; | |
5686 often called a @dfn{pane} in window-system parlance. | |
5687 @item window-configuration | |
5688 An object that represents a saved configuration of windows in a frame. | |
5689 @item device | |
5690 An object representing a screen on which frames can be displayed; | |
5691 equivalent to a @dfn{display} in the X Window System and a @dfn{TTY} in | |
5692 character mode. | |
5693 @item face | |
5694 An object specifying the appearance of text or graphics; it has | |
5695 properties such as font, foreground color, and background color. | |
5696 @item marker | |
5697 An object that refers to a particular position in a buffer and moves | |
5698 around as text is inserted and deleted to stay in the same relative | |
5699 position to the text around it. | |
5700 @item extent | |
5701 Similar to a marker but covers a range of text in a buffer; can also | |
5702 specify properties of the text, such as a face in which the text is to | |
5703 be displayed, whether the text is invisible or unmodifiable, etc. | |
5704 @item event | |
5705 Generated by calling @code{next-event} and contains information | |
5706 describing a particular event happening in the system, such as the user | |
5707 pressing a key or a process terminating. | |
5708 @item keymap | |
5709 An object that maps from events (described using lists, vectors, and | |
5710 symbols rather than with an event object because the mapping is for | |
5711 classes of events, rather than individual events) to functions to | |
5712 execute or other events to recursively look up; the functions are | |
5713 described by name, using a symbol, or using lists to specify the | |
5714 function's code. | |
5715 @item glyph | |
5716 An object that describes the appearance of an image (e.g. pixmap) on | |
5717 the screen; glyphs can be attached to the beginning or end of extents | |
5718 and in some future version of XEmacs will be able to be inserted | |
5719 directly into a buffer. | |
5720 @item process | |
5721 An object that describes a connection to an externally-running process. | |
5722 @end table | |
5723 | |
5724 There are some other, less-commonly-encountered general objects: | |
5725 | |
5726 @table @code | |
5727 @item hash-table | |
5728 An object that maps from an arbitrary Lisp object to another arbitrary | |
5729 Lisp object, using hashing for fast lookup. | |
5730 @item obarray | |
5731 A limited form of hash-table that maps from strings to symbols; obarrays | |
5732 are used to look up a symbol given its name and are not actually their | |
5733 own object type but are kludgily represented using vectors with hidden | |
5734 fields (this representation derives from GNU Emacs). | |
5735 @item specifier | |
5736 A complex object used to specify the value of a display property; a | |
5737 default value is given and different values can be specified for | |
5738 particular frames, buffers, windows, devices, or classes of device. | |
5739 @item char-table | |
5740 An object that maps from chars or classes of chars to arbitrary Lisp | |
5741 objects; internally char tables use a complex nested-vector | |
5742 representation that is optimized to the way characters are represented | |
5743 as integers. | |
5744 @item range-table | |
5745 An object that maps from ranges of integers to arbitrary Lisp objects. | |
5746 @end table | |
5747 | |
5748 And some strange special-purpose objects: | |
5749 | |
5750 @table @code | |
5751 @item charset | |
5752 @itemx coding-system | |
5753 Objects used when MULE, or multi-lingual/Asian-language, support is | |
5754 enabled. | |
5755 @item color-instance | |
5756 @itemx font-instance | |
5757 @itemx image-instance | |
5758 An object that encapsulates a window-system resource; instances are | |
5759 mostly used internally but are exposed on the Lisp level for cleanness | |
5760 of the specifier model and because it's occasionally useful for Lisp | |
5761 program to create or query the properties of instances. | |
5762 @item subwindow | |
5763 An object that encapsulate a @dfn{subwindow} resource, i.e. a | |
5764 window-system child window that is drawn into by an external process; | |
5765 this object should be integrated into the glyph system but isn't yet, | |
5766 and may change form when this is done. | |
5767 @item tooltalk-message | |
5768 @itemx tooltalk-pattern | |
5769 Objects that represent resources used in the ToolTalk interprocess | |
5770 communication protocol. | |
5771 @item toolbar-button | |
5772 An object used in conjunction with the toolbar. | |
5773 @end table | |
5774 | |
5775 And objects that are only used internally: | |
5776 | |
5777 @table @code | |
5778 @item opaque | |
5779 A generic object for encapsulating arbitrary memory; this allows you the | |
5780 generality of @code{malloc()} and the convenience of the Lisp object | |
5781 system. | |
5782 @item lstream | |
5783 A buffering I/O stream, used to provide a unified interface to anything | |
5784 that can accept output or provide input, such as a file descriptor, a | |
5785 stdio stream, a chunk of memory, a Lisp buffer, a Lisp string, etc.; | |
5786 it's a Lisp object to make its memory management more convenient. | |
5787 @item char-table-entry | |
5788 Subsidiary objects in the internal char-table representation. | |
5789 @item extent-auxiliary | |
5790 @itemx menubar-data | |
5791 @itemx toolbar-data | |
5792 Various special-purpose objects that are basically just used to | |
5793 encapsulate memory for particular subsystems, similar to the more | |
5794 general ``opaque'' object. | |
5795 @item symbol-value-forward | |
5796 @itemx symbol-value-buffer-local | |
5797 @itemx symbol-value-varalias | |
5798 @itemx symbol-value-lisp-magic | |
5799 Special internal-only objects that are placed in the value cell of a | |
5800 symbol to indicate that there is something special with this variable -- | |
5801 e.g. it has no value, it mirrors another variable, or it mirrors some C | |
5802 variable; there is really only one kind of object, called a | |
5803 @dfn{symbol-value-magic}, but it is sort-of halfway kludged into | |
5804 semi-different object types. | |
5805 @end table | |
5806 | |
5807 @cindex permanent objects | |
5808 @cindex temporary objects | |
5809 Some types of objects are @dfn{permanent}, meaning that once created, | |
5810 they do not disappear until explicitly destroyed, using a function such | |
5811 as @code{delete-buffer}, @code{delete-window}, @code{delete-frame}, etc. | |
5812 Others will disappear once they are not longer used, through the garbage | |
5813 collection mechanism. Buffers, frames, windows, devices, and processes | |
5814 are among the objects that are permanent. Note that some objects can go | |
5815 both ways: Faces can be created either way; extents are normally | |
5816 permanent, but detached extents (extents not referring to any text, as | |
5817 happens to some extents when the text they are referring to is deleted) | |
5818 are temporary. Note that some permanent objects, such as faces and | |
5819 coding systems, cannot be deleted. Note also that windows are unique in | |
5820 that they can be @emph{undeleted} after having previously been | |
5821 deleted. (This happens as a result of restoring a window configuration.) | |
5822 | |
5823 @cindex read syntax | |
5824 Many types of objects have a @dfn{read syntax}, i.e. a way of | |
5825 specifying an object of that type in Lisp code. When you load a Lisp | |
5826 file, or type in code to be evaluated, what really happens is that the | |
5827 function @code{read} is called, which reads some text and creates an object | |
5828 based on the syntax of that text; then @code{eval} is called, which | |
5829 possibly does something special; then this loop repeats until there's | |
5830 no more text to read. (@code{eval} only actually does something special | |
5831 with symbols, which causes the symbol's value to be returned, | |
5832 similar to referencing a variable; and with conses [i.e. lists], | |
5833 which cause a function invocation. All other values are returned | |
5834 unchanged.) | |
5835 | |
5836 The read syntax | |
5837 | |
5838 @example | |
5839 17297 | |
5840 @end example | |
5841 | |
5842 converts to an integer whose value is 17297. | |
5843 | |
5844 @example | |
5845 355/113 | |
5846 @end example | |
5847 | |
5848 converts to a ratio commonly used to approximate @emph{pi} when ratios | |
5849 are configured, and otherwise to a symbol whose name is ``355/113'' (for | |
5850 backward compatibility). | |
5851 | |
5852 @example | |
5853 1.983e-4 | |
5854 @end example | |
5855 | |
5856 converts to a float whose value is 1.983e-4, or .0001983. | |
5857 | |
5858 @example | |
5859 ?b | |
5860 @end example | |
5861 | |
5862 converts to a char that represents the lowercase letter b. | |
5863 | |
5864 @example | |
5865 ?^[$(B#&^[(B | |
5866 @end example | |
5867 | |
5868 (where @samp{^[} actually is an @samp{ESC} character) converts to a | |
5869 particular Kanji character when using an ISO2022-based coding system for | |
5870 input. (To decode this goo: @samp{ESC} begins an escape sequence; | |
5871 @samp{ESC $ (} is a class of escape sequences meaning ``switch to a | |
5872 94x94 character set''; @samp{ESC $ ( B} means ``switch to Japanese | |
5873 Kanji''; @samp{#} and @samp{&} collectively index into a 94-by-94 array | |
5874 of characters [subtract 33 from the ASCII value of each character to get | |
5875 the corresponding index]; @samp{ESC (} is a class of escape sequences | |
5876 meaning ``switch to a 94 character set''; @samp{ESC (B} means ``switch | |
5877 to US ASCII''. It is a coincidence that the letter @samp{B} is used to | |
5878 denote both Japanese Kanji and US ASCII. If the first @samp{B} were | |
5879 replaced with an @samp{A}, you'd be requesting a Chinese Hanzi character | |
5880 from the GB2312 character set.) | |
5881 | |
5882 @example | |
5883 "foobar" | |
5884 @end example | |
5885 | |
5886 converts to a string. | |
5887 | |
5888 @example | |
5889 foobar | |
5890 @end example | |
5891 | |
5892 converts to a symbol whose name is @code{"foobar"}. This is done by | |
5893 looking up the string equivalent in the global variable | |
5894 @code{obarray}, whose contents should be an obarray. If no symbol | |
5895 is found, a new symbol with the name @code{"foobar"} is automatically | |
5896 created and added to @code{obarray}; this process is called | |
5897 @dfn{interning} the symbol. | |
5898 @cindex interning | |
5899 | |
5900 @example | |
5901 (foo . bar) | |
5902 @end example | |
5903 | |
5904 converts to a cons cell containing the symbols @code{foo} and @code{bar}. | |
5905 | |
5906 @example | |
5907 (1 a 2.5) | |
5908 @end example | |
5909 | |
5910 converts to a three-element list containing the specified objects | |
5911 (note that a list is actually a set of nested conses; see the | |
5912 XEmacs Lisp Reference). | |
5913 | |
5914 @example | |
5915 [1 a 2.5] | |
5916 @end example | |
5917 | |
5918 converts to a three-element vector containing the specified objects. | |
5919 | |
5920 @example | |
5921 #[... ... ... ...] | |
5922 @end example | |
5923 | |
5924 converts to a compiled-function object (the actual contents are not | |
5925 shown since they are not relevant here; look at a file that ends with | |
5926 @file{.elc} for examples). | |
5927 | |
5928 @example | |
5929 #*01110110 | |
5930 @end example | |
5931 | |
5932 converts to a bit-vector. | |
5933 | |
5934 @example | |
5935 #s(hash-table ... ...) | |
5936 @end example | |
5937 | |
5938 converts to a hash table (the actual contents are not shown). | |
5939 | |
5940 @example | |
5941 #s(range-table ... ...) | |
5942 @end example | |
5943 | |
5944 converts to a range table (the actual contents are not shown). | |
5945 | |
5946 @example | |
5947 #s(char-table ... ...) | |
5948 @end example | |
5949 | |
5950 converts to a char table (the actual contents are not shown). | |
5951 | |
5952 Note that the @code{#s()} syntax is the general syntax for structures, | |
5953 which are not really implemented in XEmacs Lisp but should be. | |
5954 | |
5955 When an object is printed out (using @code{print} or a related | |
5956 function), the read syntax is used, so that the same object can be read | |
5957 in again. | |
5958 | |
5959 The other objects do not have read syntaxes, usually because it does not | |
5960 really make sense to create them in this fashion (i.e. processes, where | |
5961 it doesn't make sense to have a subprocess created as a side effect of | |
5962 reading some Lisp code), or because they can't be created at all | |
5963 (e.g. subrs). Permanent objects, as a rule, do not have a read syntax; | |
5964 nor do most complex objects, which contain too much state to be easily | |
5965 initialized through a read syntax. | |
5966 | |
5967 @node How Lisp Objects Are Represented in C, Allocation of Objects in XEmacs Lisp, The XEmacs Object System (Abstractly Speaking), Top | |
5968 @chapter How Lisp Objects Are Represented in C | |
5969 @cindex Lisp objects are represented in C, how | |
5970 @cindex objects are represented in C, how Lisp | |
5971 @cindex represented in C, how Lisp objects are | |
5972 | |
5973 Lisp objects are represented in C using a 32-bit or 64-bit machine word | |
5974 (depending on the processor; i.e. DEC Alphas use 64-bit Lisp objects and | |
5975 most other processors use 32-bit Lisp objects). The representation | |
5976 stuffs a pointer together with a tag, as follows: | |
5977 | |
5978 @example | |
5979 [ 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ] | |
5980 [ 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ] | |
5981 | |
5982 <---------------------------------------------------------> <-> | |
5983 a pointer to a structure, or an integer tag | |
5984 @end example | |
5985 | |
5986 A tag of 00 is used for all pointer object types, a tag of 10 is used | |
5987 for characters, and the other two tags 01 and 11 are joined together to | |
5988 form the integer object type. This representation gives us 31 bit | |
5989 integers and 30 bit characters, while pointers are represented directly | |
5990 without any bit masking or shifting. This representation, though, | |
5991 assumes that pointers to structs are always aligned to multiples of 4, | |
5992 so the lower 2 bits are always zero. | |
5993 | |
5994 Lisp objects use the typedef @code{Lisp_Object}, but the actual C type | |
5995 used for the Lisp object can vary. It can be either a simple type | |
5996 (@code{long} on the DEC Alpha, @code{int} on other machines) or a | |
5997 structure whose fields are bit fields that line up properly (actually, a | |
5998 union of structures is used). Generally the simple integral type is | |
5999 preferable because it ensures that the compiler will actually use a | |
6000 machine word to represent the object (some compilers will use more | |
6001 general and less efficient code for unions and structs even if they can | |
6002 fit in a machine word). The union type, however, has the advantage of | |
6003 stricter type checking. If you accidentally pass an integer where a Lisp | |
6004 object is desired, you get a compile error. The choice of which type | |
6005 to use is determined by the preprocessor constant @code{USE_UNION_TYPE} | |
6006 which is defined via the @code{--use-union-type} option to | |
6007 @code{configure}. | |
6008 | |
6009 Various macros are used to convert between Lisp_Objects and the | |
6010 corresponding C type. Macros of the form @code{XINT()}, @code{XCHAR()}, | |
6011 @code{XSTRING()}, @code{XSYMBOL()}, do any required bit shifting and/or | |
6012 masking and cast it to the appropriate type. @code{XINT()} needs to be | |
6013 a bit tricky so that negative numbers are properly sign-extended. Since | |
6014 integers are stored left-shifted, if the right-shift operator does an | |
6015 arithmetic shift (i.e. it leaves the most-significant bit as-is rather | |
6016 than shifting in a zero, so that it mimics a divide-by-two even for | |
6017 negative numbers) the shift to remove the tag bit is enough. This is | |
6018 the case on all the systems we support. | |
6019 | |
6020 Note that when @code{ERROR_CHECK_TYPECHECK} is defined, the converter | |
6021 macros become more complicated---they check the tag bits and/or the | |
6022 type field in the first four bytes of a record type to ensure that the | |
6023 object is really of the correct type. This is great for catching places | |
6024 where an incorrect type is being dereferenced---this typically results | |
6025 in a pointer being dereferenced as the wrong type of structure, with | |
6026 unpredictable (and sometimes not easily traceable) results. | |
6027 | |
6028 There are similar @code{XSET@var{TYPE}()} macros that construct a Lisp | |
6029 object. These macros are of the form @code{XSET@var{TYPE} | |
6030 (@var{lvalue}, @var{result})}, i.e. they have to be a statement rather | |
6031 than just used in an expression. The reason for this is that standard C | |
6032 doesn't let you ``construct'' a structure (but GCC does). Granted, this | |
6033 sometimes isn't too convenient; for the case of integers, at least, you | |
6034 can use the function @code{make_int()}, which constructs and | |
6035 @emph{returns} an integer Lisp object. Note that the | |
6036 @code{XSET@var{TYPE}()} macros are also affected by | |
6037 @code{ERROR_CHECK_TYPECHECK} and make sure that the structure is of the | |
6038 right type in the case of record types, where the type is contained in | |
6039 the structure. | |
6040 | |
6041 The C programmer is responsible for @strong{guaranteeing} that a | |
6042 Lisp_Object is the correct type before using the @code{X@var{TYPE}} | |
6043 macros. This is especially important in the case of lists. Use | |
6044 @code{XCAR} and @code{XCDR} if a Lisp_Object is certainly a cons cell, | |
6045 else use @code{Fcar()} and @code{Fcdr()}. Trust other C code, but not | |
6046 Lisp code. On the other hand, if XEmacs has an internal logic error, | |
6047 it's better to crash immediately, so sprinkle @code{assert()}s and | |
6048 ``unreachable'' @code{abort()}s liberally about the source code. Where | |
6049 performance is an issue, use @code{type_checking_assert}, | |
6050 @code{bufpos_checking_assert}, and @code{gc_checking_assert}, which do | |
6051 nothing unless the corresponding configure error checking flag was | |
6052 specified. | |
6053 | |
6054 @node Allocation of Objects in XEmacs Lisp, The Lisp Reader and Compiler, How Lisp Objects Are Represented in C, Top | |
5784 @chapter Allocation of Objects in XEmacs Lisp | 6055 @chapter Allocation of Objects in XEmacs Lisp |
5785 @cindex allocation of objects in XEmacs Lisp | 6056 @cindex allocation of objects in XEmacs Lisp |
5786 @cindex objects in XEmacs Lisp, allocation of | 6057 @cindex objects in XEmacs Lisp, allocation of |
5787 @cindex Lisp objects, allocation of in XEmacs | 6058 @cindex Lisp objects, allocation of in XEmacs |
5788 | 6059 |
7058 @cindex function, compiled | 7329 @cindex function, compiled |
7059 | 7330 |
7060 Not yet documented. | 7331 Not yet documented. |
7061 | 7332 |
7062 | 7333 |
7063 @node Dumping, Events and the Event Loop, Allocation of Objects in XEmacs Lisp, Top | 7334 @node The Lisp Reader and Compiler, Evaluation; Stack Frames; Bindings, Allocation of Objects in XEmacs Lisp, Top |
7064 @chapter Dumping | 7335 @chapter The Lisp Reader and Compiler |
7065 @cindex dumping | 7336 @cindex Lisp reader and compiler, the |
7337 @cindex reader and compiler, the Lisp | |
7338 @cindex compiler, the Lisp reader and | |
7339 | |
7340 Not yet documented. | |
7341 | |
7342 @node Evaluation; Stack Frames; Bindings, Symbols and Variables, The Lisp Reader and Compiler, Top | |
7343 @chapter Evaluation; Stack Frames; Bindings | |
7344 @cindex evaluation; stack frames; bindings | |
7345 @cindex stack frames; bindings, evaluation; | |
7346 @cindex bindings, evaluation; stack frames; | |
7066 | 7347 |
7067 @menu | 7348 @menu |
7068 * Dumping Justification:: | 7349 * Evaluation:: |
7069 * Overview:: | 7350 * Dynamic Binding; The specbinding Stack; Unwind-Protects:: |
7070 * Data descriptions:: | 7351 * Simple Special Forms:: |
7071 * Dumping phase:: | 7352 * Catch and Throw:: |
7072 * Reloading phase:: | 7353 * Error Trapping:: |
7073 * Remaining issues:: | |
7074 @end menu | 7354 @end menu |
7075 | 7355 |
7076 @node Dumping Justification, Overview, Dumping, Dumping | 7356 @node Evaluation, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings, Evaluation; Stack Frames; Bindings |
7077 @section Dumping Justification | 7357 @section Evaluation |
7078 @cindex dumping, justification | 7358 @cindex evaluation |
7079 | 7359 |
7080 The C code of XEmacs is just a Lisp engine with a lot of built-in | 7360 @code{Feval()} evaluates the form (a Lisp object) that is passed to |
7081 primitives useful for writing an editor. The editor itself is written | 7361 it. Note that evaluation is only non-trivial for two types of objects: |
7082 mostly in Lisp, and represents around 100K lines of code. Loading and | 7362 symbols and conses. A symbol is evaluated simply by calling |
7083 executing the initialization of all this code takes a bit a time (five | 7363 @code{symbol-value} on it and returning the value. |
7084 to ten times the usual startup time of current xemacs) and requires | 7364 |
7085 having all the lisp source files around. Having to reload them each | 7365 Evaluating a cons means calling a function. First, @code{eval} checks |
7086 time the editor is started would not be acceptable. | 7366 to see if garbage-collection is necessary, and calls |
7087 | 7367 @code{garbage_collect_1()} if so. It then increases the evaluation |
7088 The traditional solution to this problem is called dumping: the build | 7368 depth by 1 (@code{lisp_eval_depth}, which is always less than |
7089 process first creates the lisp engine under the name @file{temacs}, then | 7369 @code{max_lisp_eval_depth}) and adds an element to the linked list of |
7090 runs it until it has finished loading and initializing all the lisp | 7370 @code{struct backtrace}'s (@code{backtrace_list}). Each such structure |
7091 code, and eventually creates a new executable called @file{xemacs} | 7371 contains a pointer to the function being called plus a list of the |
7092 including both the object code in @file{temacs} and all the contents of | 7372 function's arguments. Originally these values are stored unevalled, and |
7093 the memory after the initialization. | 7373 as they are evaluated, the backtrace structure is updated. Garbage |
7094 | 7374 collection pays attention to the objects pointed to in the backtrace |
7095 This solution, while working, has a huge problem: the creation of the | 7375 structures (garbage collection might happen while a function is being |
7096 new executable from the actual contents of memory is an extremely | 7376 called or while an argument is being evaluated, and there could easily |
7097 system-specific process, quite error-prone, and which interferes with a | 7377 be no other references to the arguments in the argument list; once an |
7098 lot of system libraries (like malloc). It is even getting worse | 7378 argument is evaluated, however, the unevalled version is not needed by |
7099 nowadays with libraries using constructors which are automatically | 7379 eval, and so the backtrace structure is changed). |
7100 called when the program is started (even before @code{main()}) which tend to | 7380 |
7101 crash when they are called multiple times, once before dumping and once | 7381 At this point, the function to be called is determined by looking at |
7102 after (IRIX 6.x @file{libz.so} pulls in some C++ image libraries thru | 7382 the car of the cons (if this is a symbol, its function definition is |
7103 dependencies which have this problem). Writing the dumper is also one | 7383 retrieved and the process repeated). The function should then consist |
7104 of the most difficult parts of porting XEmacs to a new operating system. | 7384 of either a @code{Lisp_Subr} (built-in function written in C), a |
7105 Basically, `dumping' is an operation that is just not officially | 7385 @code{Lisp_Compiled_Function} object, or a cons whose car is one of the |
7106 supported on many operating systems. | 7386 symbols @code{autoload}, @code{macro} or @code{lambda}. |
7107 | 7387 |
7108 The aim of the portable dumper is to solve the same problem as the | 7388 If the function is a @code{Lisp_Subr}, the lisp object points to a |
7109 system-specific dumper, that is to be able to reload quickly, using only | 7389 @code{struct Lisp_Subr} (created by @code{DEFUN()}), which contains a |
7110 a small number of files, the fully initialized lisp part of the editor, | 7390 pointer to the C function, a minimum and maximum number of arguments |
7111 without any system-specific hacks. | 7391 (or possibly the special constants @code{MANY} or @code{UNEVALLED}), a |
7112 | 7392 pointer to the symbol referring to that subr, and a couple of other |
7113 @node Overview, Data descriptions, Dumping Justification, Dumping | 7393 things. If the subr wants its arguments @code{UNEVALLED}, they are |
7114 @section Overview | 7394 passed raw as a list. Otherwise, an array of evaluated arguments is |
7115 @cindex dumping overview | 7395 created and put into the backtrace structure, and either passed whole |
7116 | 7396 (@code{MANY}) or each argument is passed as a C argument. |
7117 The portable dumping system has to: | 7397 |
7118 | 7398 If the function is a @code{Lisp_Compiled_Function}, |
7399 @code{funcall_compiled_function()} is called. If the function is a | |
7400 lambda list, @code{funcall_lambda()} is called. If the function is a | |
7401 macro, [..... fill in] is done. If the function is an autoload, | |
7402 @code{do_autoload()} is called to load the definition and then eval | |
7403 starts over [explain this more]. | |
7404 | |
7405 When @code{Feval()} exits, the evaluation depth is reduced by one, the | |
7406 debugger is called if appropriate, and the current backtrace structure | |
7407 is removed from the list. | |
7408 | |
7409 Both @code{funcall_compiled_function()} and @code{funcall_lambda()} need | |
7410 to go through the list of formal parameters to the function and bind | |
7411 them to the actual arguments, checking for @code{&rest} and | |
7412 @code{&optional} symbols in the formal parameters and making sure the | |
7413 number of actual arguments is correct. | |
7414 @code{funcall_compiled_function()} can do this a little more | |
7415 efficiently, since the formal parameter list can be checked for sanity | |
7416 when the compiled function object is created. | |
7417 | |
7418 @code{funcall_lambda()} simply calls @code{Fprogn} to execute the code | |
7419 in the lambda list. | |
7420 | |
7421 @code{funcall_compiled_function()} calls the real byte-code interpreter | |
7422 @code{execute_optimized_program()} on the byte-code instructions, which | |
7423 are converted into an internal form for faster execution. | |
7424 | |
7425 When a compiled function is executed for the first time by | |
7426 @code{funcall_compiled_function()}, or during the dump phase of building | |
7427 XEmacs, the byte-code instructions are converted from a | |
7428 @code{Lisp_String} (which is inefficient to access, especially in the | |
7429 presence of MULE) into a @code{Lisp_Opaque} object containing an array | |
7430 of unsigned char, which can be directly executed by the byte-code | |
7431 interpreter. At this time the byte code is also analyzed for validity | |
7432 and transformed into a more optimized form, so that | |
7433 @code{execute_optimized_program()} can really fly. | |
7434 | |
7435 Here are some of the optimizations performed by the internal byte-code | |
7436 transformer: | |
7119 @enumerate | 7437 @enumerate |
7120 @item | 7438 @item |
7121 At dump time, write all initialized, non-quickly-rebuildable data to a | 7439 References to the @code{constants} array are checked for out-of-range |
7122 file [Note: currently named @file{xemacs.dmp}, but the name will | 7440 indices, so that the byte interpreter doesn't have to. |
7123 change], along with all information needed for the reloading. | 7441 @item |
7124 | 7442 References to the @code{constants} array that will be used as a Lisp |
7125 @item | 7443 variable are checked for being correct non-constant (i.e. not @code{t}, |
7126 When starting xemacs, reload the dump file, relocate it to its new | 7444 @code{nil}, or @code{keywordp}) symbols, so that the byte interpreter |
7127 starting address if needed, and reinitialize all pointers to this | 7445 doesn't have to. |
7128 data. Also, rebuild all the quickly rebuildable data. | 7446 @item |
7447 The maximum number of variable bindings in the byte-code is | |
7448 pre-computed, so that space on the @code{specpdl} stack can be | |
7449 pre-reserved once for the whole function execution. | |
7450 @item | |
7451 All byte-code jumps are relative to the current program counter instead | |
7452 of the start of the program, thereby saving a register. | |
7453 @item | |
7454 One-byte relative jumps are converted from the byte-code form of unsigned | |
7455 chars offset by 127 to machine-friendly signed chars. | |
7129 @end enumerate | 7456 @end enumerate |
7130 | 7457 |
7131 Note: As of 21.5.18, the dump file has been moved inside of the | 7458 Of course, this transformation of the @code{instructions} should not be |
7132 executable, although there are still problems with this on some systems. | 7459 visible to the user, so @code{Fcompiled_function_instructions()} needs |
7133 | 7460 to know how to convert the optimized opaque object back into a Lisp |
7134 @node Data descriptions, Dumping phase, Overview, Dumping | 7461 string that is identical to the original string from the @file{.elc} |
7135 @section Data descriptions | 7462 file. (Actually, the resulting string may (rarely) contain slightly |
7136 @cindex dumping data descriptions | 7463 different, yet equivalent, byte code.) |
7137 | 7464 |
7138 The more complex task of the dumper is to be able to write memory blocks | 7465 @code{Ffuncall()} implements Lisp @code{funcall}. @code{(funcall fun |
7139 on the heap (lisp objects, i.e. lrecords, and C-allocated memory, such | 7466 x1 x2 x3 ...)} is equivalent to @code{(eval (list fun (quote x1) (quote |
7140 as structs and arrays) to disk and reload them at a different address, | 7467 x2) (quote x3) ...))}. @code{Ffuncall()} contains its own code to do |
7141 updating all the pointers they include in the process. This is done by | 7468 the evaluation, however, and is very similar to @code{Feval()}. |
7142 using external data descriptions that give information about the layout | 7469 |
7143 of the blocks in memory. | 7470 From the performance point of view, it is worth knowing that most of the |
7144 | 7471 time in Lisp evaluation is spent executing @code{Lisp_Subr} and |
7145 The specification of these descriptions is in lrecord.h. A description | 7472 @code{Lisp_Compiled_Function} objects via @code{Ffuncall()} (not |
7146 of an lrecord is an array of struct memory_description. Each of these | 7473 @code{Feval()}). |
7147 structs include a type, an offset in the block and some optional | 7474 |
7148 parameters depending on the type. For instance, here is the string | 7475 @code{Fapply()} implements Lisp @code{apply}, which is very similar to |
7149 description: | 7476 @code{funcall} except that if the last argument is a list, the result is the |
7477 same as if each of the arguments in the list had been passed separately. | |
7478 @code{Fapply()} does some business to expand the last argument if it's a | |
7479 list, then calls @code{Ffuncall()} to do the work. | |
7480 | |
7481 @code{apply1()}, @code{call0()}, @code{call1()}, @code{call2()}, and | |
7482 @code{call3()} call a function, passing it the argument(s) given (the | |
7483 arguments are given as separate C arguments rather than being passed as | |
7484 an array). @code{apply1()} uses @code{Fapply()} while the others use | |
7485 @code{Ffuncall()} to do the real work. | |
7486 | |
7487 @node Dynamic Binding; The specbinding Stack; Unwind-Protects, Simple Special Forms, Evaluation, Evaluation; Stack Frames; Bindings | |
7488 @section Dynamic Binding; The specbinding Stack; Unwind-Protects | |
7489 @cindex dynamic binding; the specbinding stack; unwind-protects | |
7490 @cindex binding; the specbinding stack; unwind-protects, dynamic | |
7491 @cindex specbinding stack; unwind-protects, dynamic binding; the | |
7492 @cindex unwind-protects, dynamic binding; the specbinding stack; | |
7150 | 7493 |
7151 @example | 7494 @example |
7152 static const struct memory_description string_description[] = @{ | 7495 struct specbinding |
7153 @{ XD_BYTECOUNT, offsetof (Lisp_String, size) @}, | 7496 @{ |
7154 @{ XD_OPAQUE_DATA_PTR, offsetof (Lisp_String, data), XD_INDIRECT(0, 1) @}, | 7497 Lisp_Object symbol; |
7155 @{ XD_LISP_OBJECT, offsetof (Lisp_String, plist) @}, | 7498 Lisp_Object old_value; |
7156 @{ XD_END @} | 7499 Lisp_Object (*func) (Lisp_Object); /* for unwind-protect */ |
7157 @}; | 7500 @}; |
7158 @end example | 7501 @end example |
7159 | 7502 |
7160 The first line indicates a member of type Bytecount, which is used by | 7503 @code{struct specbinding} is used for local-variable bindings and |
7161 the next, indirect directive. The second means "there is a pointer to | 7504 unwind-protects. @code{specpdl} holds an array of @code{struct specbinding}'s, |
7162 some opaque data in the field @code{data}". The length of said data is | 7505 @code{specpdl_ptr} points to the beginning of the free bindings in the |
7163 given by the expression @code{XD_INDIRECT(0, 1)}, which means "the value | 7506 array, @code{specpdl_size} specifies the total number of binding slots |
7164 in the 0th line of the description (welcome to C) plus one". The third | 7507 in the array, and @code{max_specpdl_size} specifies the maximum number |
7165 line means "there is a Lisp_Object member @code{plist} in the Lisp_String | 7508 of bindings the array can be expanded to hold. @code{grow_specpdl()} |
7166 structure". @code{XD_END} then ends the description. | 7509 increases the size of the @code{specpdl} array, multiplying its size by |
7167 | 7510 2 but never exceeding @code{max_specpdl_size} (except that if this |
7168 This gives us all the information we need to move around what is pointed | 7511 number is less than 400, it is first set to 400). |
7169 to by a memory block (C or lrecord) and, by transitivity, everything | 7512 |
7170 that it points to. The only missing information for dumping is the size | 7513 @code{specbind()} binds a symbol to a value and is used for local |
7171 of the block. For lrecords, this is part of the | 7514 variables and @code{let} forms. The symbol and its old value (which |
7172 lrecord_implementation, so we don't need to duplicate it. For C blocks | 7515 might be @code{Qunbound}, indicating no prior value) are recorded in the |
7173 we use a struct sized_memory_description, which includes a size field | 7516 specpdl array, and @code{specpdl_size} is increased by 1. |
7174 and a pointer to an associated array of memory_description. | 7517 |
7175 | 7518 @code{record_unwind_protect()} implements an @dfn{unwind-protect}, |
7176 @node Dumping phase, Reloading phase, Data descriptions, Dumping | 7519 which, when placed around a section of code, ensures that some specified |
7177 @section Dumping phase | 7520 cleanup routine will be executed even if the code exits abnormally |
7178 @cindex dumping phase | 7521 (e.g. through a @code{throw} or quit). @code{record_unwind_protect()} |
7179 | 7522 simply adds a new specbinding to the @code{specpdl} array and stores the |
7180 Dumping is done by calling the function @code{pdump()} (in @file{dumper.c}) which is | 7523 appropriate information in it. The cleanup routine can either be a C |
7181 invoked from Fdump_emacs (in @file{emacs.c}). This function performs a number | 7524 function, which is stored in the @code{func} field, or a @code{progn} |
7182 of tasks. | 7525 form, which is stored in the @code{old_value} field. |
7526 | |
7527 @code{unbind_to()} removes specbindings from the @code{specpdl} array | |
7528 until the specified position is reached. Each specbinding can be one of | |
7529 three types: | |
7530 | |
7531 @enumerate | |
7532 @item | |
7533 an unwind-protect with a C cleanup function (@code{func} is not 0, and | |
7534 @code{old_value} holds an argument to be passed to the function); | |
7535 @item | |
7536 an unwind-protect with a Lisp form (@code{func} is 0, @code{symbol} | |
7537 is @code{nil}, and @code{old_value} holds the form to be executed with | |
7538 @code{Fprogn()}); or | |
7539 @item | |
7540 a local-variable binding (@code{func} is 0, @code{symbol} is not | |
7541 @code{nil}, and @code{old_value} holds the old value, which is stored as | |
7542 the symbol's value). | |
7543 @end enumerate | |
7544 | |
7545 @node Simple Special Forms, Catch and Throw, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings | |
7546 @section Simple Special Forms | |
7547 @cindex special forms, simple | |
7548 | |
7549 @code{or}, @code{and}, @code{if}, @code{cond}, @code{progn}, | |
7550 @code{prog1}, @code{prog2}, @code{setq}, @code{quote}, @code{function}, | |
7551 @code{let*}, @code{let}, @code{while} | |
7552 | |
7553 All of these are very simple and work as expected, calling | |
7554 @code{Feval()} or @code{Fprogn()} as necessary and (in the case of | |
7555 @code{let} and @code{let*}) using @code{specbind()} to create bindings | |
7556 and @code{unbind_to()} to undo the bindings when finished. | |
7557 | |
7558 Note that, with the exception of @code{Fprogn}, these functions are | |
7559 typically called in real life only in interpreted code, since the byte | |
7560 compiler knows how to convert calls to these functions directly into | |
7561 byte code. | |
7562 | |
7563 @node Catch and Throw, Error Trapping, Simple Special Forms, Evaluation; Stack Frames; Bindings | |
7564 @section Catch and Throw | |
7565 @cindex catch and throw | |
7566 @cindex throw, catch and | |
7567 | |
7568 @example | |
7569 struct catchtag | |
7570 @{ | |
7571 Lisp_Object tag; | |
7572 Lisp_Object val; | |
7573 struct catchtag *next; | |
7574 struct gcpro *gcpro; | |
7575 jmp_buf jmp; | |
7576 struct backtrace *backlist; | |
7577 int lisp_eval_depth; | |
7578 int pdlcount; | |
7579 @}; | |
7580 @end example | |
7581 | |
7582 @code{catch} is a Lisp function that places a catch around a body of | |
7583 code. A catch is a means of non-local exit from the code. When a catch | |
7584 is created, a tag is specified, and executing a @code{throw} to this tag | |
7585 will exit from the body of code caught with this tag, and its value will | |
7586 be the value given in the call to @code{throw}. If there is no such | |
7587 call, the code will be executed normally. | |
7588 | |
7589 Information pertaining to a catch is held in a @code{struct catchtag}, | |
7590 which is placed at the head of a linked list pointed to by | |
7591 @code{catchlist}. @code{internal_catch()} is passed a C function to | |
7592 call (@code{Fprogn()} when Lisp @code{catch} is called) and arguments to | |
7593 give it, and places a catch around the function. Each @code{struct | |
7594 catchtag} is held in the stack frame of the @code{internal_catch()} | |
7595 instance that created the catch. | |
7596 | |
7597 @code{internal_catch()} is fairly straightforward. It stores into the | |
7598 @code{struct catchtag} the tag name and the current values of | |
7599 @code{backtrace_list}, @code{lisp_eval_depth}, @code{gcprolist}, and the | |
7600 offset into the @code{specpdl} array, sets a jump point with @code{_setjmp()} | |
7601 (storing the jump point into the @code{struct catchtag}), and calls the | |
7602 function. Control will return to @code{internal_catch()} either when | |
7603 the function exits normally or through a @code{_longjmp()} to this jump | |
7604 point. In the latter case, @code{throw} will store the value to be | |
7605 returned into the @code{struct catchtag} before jumping. When it's | |
7606 done, @code{internal_catch()} removes the @code{struct catchtag} from | |
7607 the catchlist and returns the proper value. | |
7608 | |
7609 @code{Fthrow()} goes up through the catchlist until it finds one with | |
7610 a matching tag. It then calls @code{unbind_catch()} to restore | |
7611 everything to what it was when the appropriate catch was set, stores the | |
7612 return value in the @code{struct catchtag}, and jumps (with | |
7613 @code{_longjmp()}) to its jump point. | |
7614 | |
7615 @code{unbind_catch()} removes all catches from the catchlist until it | |
7616 finds the correct one. Some of the catches might have been placed for | |
7617 error-trapping, and if so, the appropriate entries on the handlerlist | |
7618 must be removed (see ``errors''). @code{unbind_catch()} also restores | |
7619 the values of @code{gcprolist}, @code{backtrace_list}, and | |
7620 @code{lisp_eval}, and calls @code{unbind_to()} to undo any specbindings | |
7621 created since the catch. | |
7622 | |
7623 @node Error Trapping, , Catch and Throw, Evaluation; Stack Frames; Bindings | |
7624 @section Error Trapping | |
7625 @cindex error trapping | |
7626 | |
7627 @subheading call_trapping_problems(): | |
7628 | |
7629 This is equivalent to (*fun) (arg), except that various conditions | |
7630 can be trapped or inhibited, according to FLAGS. | |
7631 | |
7632 @itemize @bullet | |
7633 @item | |
7634 If FLAGS does not contain NO_INHIBIT_ERRORS, when an error occurs, | |
7635 the error is caught and a warning is issued, specifying the | |
7636 specific error that occurred and a backtrace. In that case, | |
7637 WARNING_STRING should be given, and will be printed at the | |
7638 beginning of the error to indicate where the error occurred. | |
7639 | |
7640 @item | |
7641 If FLAGS does not contain NO_INHIBIT_THROWS, all attempts to | |
7642 @code{throw} out of the function being called are trapped, and a warning | |
7643 issued. (Again, WARNING_STRING should be given.) | |
7644 | |
7645 @item | |
7646 If FLAGS contains INHIBIT_WARNING_ISSUE, no warnings are issued; | |
7647 this applies to recursive invocations of call_trapping_problems, too. | |
7648 | |
7649 @item | |
7650 If FLAGS contains POSTPONE_WARNING_ISSUE, no warnings are issued; | |
7651 but values useful for generating a warning are still computed (in | |
7652 particular, the backtrace), so that the calling function can issue | |
7653 a warning. | |
7654 | |
7655 @item | |
7656 If FLAGS contains ISSUE_WARNINGS_AT_DEBUG_LEVEL, warnings will be | |
7657 issued, but at level @code{debug}, which normally is below the minimum | |
7658 specified by @code{log-warning-minimum-level}, meaning such warnings will | |
7659 be ignored entirely. The user can change this variable, however, | |
7660 to see the warnings.) | |
7661 | |
7662 Note: If neither of NO_INHIBIT_THROWS or NO_INHIBIT_ERRORS is | |
7663 given, you are @strong{guaranteed} that there will be no non-local exits | |
7664 out of this function. | |
7665 | |
7666 @item | |
7667 If FLAGS contains INHIBIT_QUIT, QUIT using C-g is inhibited. (This | |
7668 is @strong{rarely} a good idea. Unless you use NO_INHIBIT_ERRORS, QUIT is | |
7669 automatically caught as well, and treated as an error; you can | |
7670 check for this using EQ (problems->error_conditions, Qquit). | |
7671 | |
7672 @item | |
7673 If FLAGS contains UNINHIBIT_QUIT, QUIT checking will be explicitly | |
7674 turned on. (It will abort the code being called, but will still be | |
7675 trapped and reported as an error, unless NO_INHIBIT_ERRORS is | |
7676 given.) This is useful when QUIT checking has been turned off by a | |
7677 higher-level caller. | |
7678 | |
7679 @item | |
7680 If FLAGS contains INHIBIT_GC, garbage collection is inhibited. | |
7681 This is useful for Lisp called within redisplay, for example. | |
7682 | |
7683 @item | |
7684 If FLAGS contains INHIBIT_EXISTING_PERMANENT_DISPLAY_OBJECT_DELETION, | |
7685 Lisp code is not allowed to delete any window, buffers, frames, devices, | |
7686 or consoles that were already in existence at the time this function | |
7687 was called. (However, it's perfectly legal for code to create a new | |
7688 buffer and then delete it.) | |
7689 | |
7690 #### It might be useful to have a flag that inhibits deletion of a | |
7691 specific permanent display object and everything it's attached to | |
7692 (e.g. a window, and the buffer, frame, device, and console it's | |
7693 attached to. | |
7694 | |
7695 @item | |
7696 If FLAGS contains INHIBIT_EXISTING_BUFFER_TEXT_MODIFICATION, Lisp | |
7697 code is not allowed to modify the text of any buffers that were | |
7698 already in existence at the time this function was called. | |
7699 (However, it's perfectly legal for code to create a new buffer and | |
7700 then modify its text.) | |
7701 | |
7702 @quotation | |
7703 [These last two flags are implemented using global variables | |
7704 Vdeletable_permanent_display_objects and Vmodifiable_buffers, | |
7705 which keep track of a list of all buffers or permanent display | |
7706 objects created since the last time one of these flags was set. | |
7707 The code that deletes buffers, etc. and modifies buffers checks | |
7708 | |
7709 @enumerate | |
7710 @item | |
7711 if the corresponding flag is set (through the global variable | |
7712 inhibit_flags or its accessor function get_inhibit_flags()), and | |
7713 | |
7714 @item | |
7715 if the object to be modified or deleted is not in the | |
7716 appropriate list. | |
7717 @end enumerate | |
7718 | |
7719 If so, it signals an error. | |
7720 | |
7721 Recursive calls to call_trapping_problems() are allowed. In | |
7722 the case of the two flags mentioned above, the current values | |
7723 of the global variables are stored in an unwind-protect, and | |
7724 they're reset to nil.] | |
7725 @end quotation | |
7726 | |
7727 @item | |
7728 If FLAGS contains INHIBIT_ENTERING_DEBUGGER, the debugger will not | |
7729 be entered if an error occurs inside the Lisp code being called, | |
7730 even when the user has requested an error. In such case, a warning | |
7731 is issued stating that access to the debugger is denied, unless | |
7732 INHIBIT_WARNING_ISSUE has also been supplied. This is useful when | |
7733 calling Lisp code inside redisplay, in menu callbacks, etc. because | |
7734 in such cases either the display is in an inconsistent state or | |
7735 doing window operations is explicitly forbidden by the OS, and the | |
7736 debugger would causes visual changes on the screen and might create | |
7737 another frame. | |
7738 | |
7739 @item | |
7740 If FLAGS contains INHIBIT_ANY_CHANGE_AFFECTING_REDISPLAY, no | |
7741 changes of any sort to extents, faces, glyphs, buffer text, | |
7742 specifiers relating to display, other variables relating to | |
7743 display, splitting, deleting, or resizing windows or frames, | |
7744 deleting buffers, windows, frames, devices, or consoles, etc. is | |
7745 allowed. This is for things called absolutely in the middle of | |
7746 redisplay, which expects things to be @strong{exactly} the same after the | |
7747 call as before. This isn't completely implemented and needs to be | |
7748 thought out some more to determine exactly what its semantics are. | |
7749 For the moment, turning on this flag also turns on | |
7750 | |
7751 @itemize @minus | |
7752 @item | |
7753 INHIBIT_EXISTING_PERMANENT_DISPLAY_OBJECT_DELETION | |
7754 @item | |
7755 INHIBIT_EXISTING_BUFFER_TEXT_MODIFICATION | |
7756 @item | |
7757 INHIBIT_ENTERING_DEBUGGER | |
7758 @item | |
7759 INHIBIT_WARNING_ISSUE | |
7760 @item | |
7761 INHIBIT_GC | |
7762 @end itemize | |
7763 | |
7764 @item | |
7765 #### The following five flags are defined, but unimplemented: | |
7766 | |
7767 #define INHIBIT_EXISTING_CODING_SYSTEM_DELETION (1<<6) | |
7768 #define INHIBIT_EXISTING_CHARSET_DELETION (1<<7) | |
7769 #define INHIBIT_PERMANENT_DISPLAY_OBJECT_CREATION (1<<8) | |
7770 #define INHIBIT_CODING_SYSTEM_CREATION (1<<9) | |
7771 #define INHIBIT_CHARSET_CREATION (1<<10) | |
7772 | |
7773 @item | |
7774 FLAGS containing CALL_WITH_SUSPENDED_ERRORS is a sign that | |
7775 call_with_suspended_errors() was invoked. This exists only for | |
7776 debugging purposes -- often we want to break when a signal happens, | |
7777 but ignore signals from call_with_suspended_errors(), because they | |
7778 occur often and for legitimate reasons. | |
7779 @end itemize | |
7780 | |
7781 If PROBLEM is non-zero, it should be a pointer to a structure into | |
7782 which exact information about any occurring problems (either an | |
7783 error or an attempted throw past this boundary). | |
7784 | |
7785 If a problem occurred and aborted operation (error, quit, or | |
7786 invalid throw), Qunbound is returned. Otherwise the return value | |
7787 from the call to (*fun) (arg) is returned. | |
7788 | |
7789 @node Symbols and Variables, Buffers, Evaluation; Stack Frames; Bindings, Top | |
7790 @chapter Symbols and Variables | |
7791 @cindex symbols and variables | |
7792 @cindex variables, symbols and | |
7183 | 7793 |
7184 @menu | 7794 @menu |
7185 * Object inventory:: | 7795 * Introduction to Symbols:: |
7186 * Address allocation:: | 7796 * Obarrays:: |
7187 * The header:: | 7797 * Symbol Values:: |
7188 * Data dumping:: | |
7189 * Pointers dumping:: | |
7190 @end menu | 7798 @end menu |
7191 | 7799 |
7192 @node Object inventory, Address allocation, Dumping phase, Dumping phase | 7800 @node Introduction to Symbols, Obarrays, Symbols and Variables, Symbols and Variables |
7193 @subsection Object inventory | 7801 @section Introduction to Symbols |
7194 @cindex dumping object inventory | 7802 @cindex symbols, introduction to |
7195 @cindex memory blocks | 7803 |
7196 | 7804 A symbol is basically just an object with four fields: a name (a |
7197 The first task is to build the list of the objects to dump. This | 7805 string), a value (some Lisp object), a function (some Lisp object), and |
7198 includes: | 7806 a property list (usually a list of alternating keyword/value pairs). |
7807 What makes symbols special is that there is usually only one symbol with | |
7808 a given name, and the symbol is referred to by name. This makes a | |
7809 symbol a convenient way of calling up data by name, i.e. of implementing | |
7810 variables. (The variable's value is stored in the @dfn{value slot}.) | |
7811 Similarly, functions are referenced by name, and the definition of the | |
7812 function is stored in a symbol's @dfn{function slot}. This means that | |
7813 there can be a distinct function and variable with the same name. The | |
7814 property list is used as a more general mechanism of associating | |
7815 additional values with particular names, and once again the namespace is | |
7816 independent of the function and variable namespaces. | |
7817 | |
7818 @node Obarrays, Symbol Values, Introduction to Symbols, Symbols and Variables | |
7819 @section Obarrays | |
7820 @cindex obarrays | |
7821 | |
7822 The identity of symbols with their names is accomplished through a | |
7823 structure called an obarray, which is just a poorly-implemented hash | |
7824 table mapping from strings to symbols whose name is that string. (I say | |
7825 ``poorly implemented'' because an obarray appears in Lisp as a vector | |
7826 with some hidden fields rather than as its own opaque type. This is an | |
7827 Emacs Lisp artifact that should be fixed.) | |
7828 | |
7829 Obarrays are implemented as a vector of some fixed size (which should | |
7830 be a prime for best results), where each ``bucket'' of the vector | |
7831 contains one or more symbols, threaded through a hidden @code{next} | |
7832 field in the symbol. Lookup of a symbol in an obarray, and adding a | |
7833 symbol to an obarray, is accomplished through standard hash-table | |
7834 techniques. | |
7835 | |
7836 The standard Lisp function for working with symbols and obarrays is | |
7837 @code{intern}. This looks up a symbol in an obarray given its name; if | |
7838 it's not found, a new symbol is automatically created with the specified | |
7839 name, added to the obarray, and returned. This is what happens when the | |
7840 Lisp reader encounters a symbol (or more precisely, encounters the name | |
7841 of a symbol) in some text that it is reading. There is a standard | |
7842 obarray called @code{obarray} that is used for this purpose, although | |
7843 the Lisp programmer is free to create his own obarrays and @code{intern} | |
7844 symbols in them. | |
7845 | |
7846 Note that, once a symbol is in an obarray, it stays there until | |
7847 something is done about it, and the standard obarray @code{obarray} | |
7848 always stays around, so once you use any particular variable name, a | |
7849 corresponding symbol will stay around in @code{obarray} until you exit | |
7850 XEmacs. | |
7851 | |
7852 Note that @code{obarray} itself is a variable, and as such there is a | |
7853 symbol in @code{obarray} whose name is @code{"obarray"} and which | |
7854 contains @code{obarray} as its value. | |
7855 | |
7856 Note also that this call to @code{intern} occurs only when in the Lisp | |
7857 reader, not when the code is executed (at which point the symbol is | |
7858 already around, stored as such in the definition of the function). | |
7859 | |
7860 You can create your own obarray using @code{make-vector} (this is | |
7861 horrible but is an artifact) and intern symbols into that obarray. | |
7862 Doing that will result in two or more symbols with the same name. | |
7863 However, at most one of these symbols is in the standard @code{obarray}: | |
7864 You cannot have two symbols of the same name in any particular obarray. | |
7865 Note that you cannot add a symbol to an obarray in any fashion other | |
7866 than using @code{intern}: i.e. you can't take an existing symbol and put | |
7867 it in an existing obarray. Nor can you change the name of an existing | |
7868 symbol. (Since obarrays are vectors, you can violate the consistency of | |
7869 things by storing directly into the vector, but let's ignore that | |
7870 possibility.) | |
7871 | |
7872 Usually symbols are created by @code{intern}, but if you really want, | |
7873 you can explicitly create a symbol using @code{make-symbol}, giving it | |
7874 some name. The resulting symbol is not in any obarray (i.e. it is | |
7875 @dfn{uninterned}), and you can't add it to any obarray. Therefore its | |
7876 primary purpose is as a symbol to use in macros to avoid namespace | |
7877 pollution. It can also be used as a carrier of information, but cons | |
7878 cells could probably be used just as well. | |
7879 | |
7880 You can also use @code{intern-soft} to look up a symbol but not create | |
7881 a new one, and @code{unintern} to remove a symbol from an obarray. This | |
7882 returns the removed symbol. (Remember: You can't put the symbol back | |
7883 into any obarray.) Finally, @code{mapatoms} maps over all of the symbols | |
7884 in an obarray. | |
7885 | |
7886 @node Symbol Values, , Obarrays, Symbols and Variables | |
7887 @section Symbol Values | |
7888 @cindex symbol values | |
7889 @cindex values, symbol | |
7890 | |
7891 The value field of a symbol normally contains a Lisp object. However, | |
7892 a symbol can be @dfn{unbound}, meaning that it logically has no value. | |
7893 This is internally indicated by storing a special Lisp object, called | |
7894 @dfn{the unbound marker} and stored in the global variable | |
7895 @code{Qunbound}. The unbound marker is of a special Lisp object type | |
7896 called @dfn{symbol-value-magic}. It is impossible for the Lisp | |
7897 programmer to directly create or access any object of this type. | |
7898 | |
7899 @strong{You must not let any ``symbol-value-magic'' object escape to | |
7900 the Lisp level.} Printing any of these objects will cause the message | |
7901 @samp{INTERNAL EMACS BUG} to appear as part of the print representation. | |
7902 (You may see this normally when you call @code{debug_print()} from the | |
7903 debugger on a Lisp object.) If you let one of these objects escape to | |
7904 the Lisp level, you will violate a number of assumptions contained in | |
7905 the C code and make the unbound marker not function right. | |
7906 | |
7907 When a symbol is created, its value field (and function field) are set | |
7908 to @code{Qunbound}. The Lisp programmer can restore these conditions | |
7909 later using @code{makunbound} or @code{fmakunbound}, and can query to | |
7910 see whether the value of function fields are @dfn{bound} (i.e. have a | |
7911 value other than @code{Qunbound}) using @code{boundp} and | |
7912 @code{fboundp}. The fields are set to a normal Lisp object using | |
7913 @code{set} (or @code{setq}) and @code{fset}. | |
7914 | |
7915 Other symbol-value-magic objects are used as special markers to | |
7916 indicate variables that have non-normal properties. This includes any | |
7917 variables that are tied into C variables (setting the variable magically | |
7918 sets some global variable in the C code, and likewise for retrieving the | |
7919 variable's value), variables that magically tie into slots in the | |
7920 current buffer, variables that are buffer-local, etc. The | |
7921 symbol-value-magic object is stored in the value cell in place of | |
7922 a normal object, and the code to retrieve a symbol's value | |
7923 (i.e. @code{symbol-value}) knows how to do special things with them. | |
7924 This means that you should not just fetch the value cell directly if you | |
7925 want a symbol's value. | |
7926 | |
7927 The exact workings of this are rather complex and involved and are | |
7928 well-documented in comments in @file{buffer.c}, @file{symbols.c}, and | |
7929 @file{lisp.h}. | |
7930 | |
7931 @node Buffers, Text, Symbols and Variables, Top | |
7932 @chapter Buffers | |
7933 @cindex buffers | |
7934 | |
7935 @menu | |
7936 * Introduction to Buffers:: A buffer holds a block of text such as a file. | |
7937 * Buffer Lists:: Keeping track of all buffers. | |
7938 * Markers and Extents:: Tagging locations within a buffer. | |
7939 * The Buffer Object:: The Lisp object corresponding to a buffer. | |
7940 @end menu | |
7941 | |
7942 @node Introduction to Buffers, Buffer Lists, Buffers, Buffers | |
7943 @section Introduction to Buffers | |
7944 @cindex buffers, introduction to | |
7945 | |
7946 A buffer is logically just a Lisp object that holds some text. | |
7947 In this, it is like a string, but a buffer is optimized for | |
7948 frequent insertion and deletion, while a string is not. Furthermore: | |
7949 | |
7950 @enumerate | |
7951 @item | |
7952 Buffers are @dfn{permanent} objects, i.e. once you create them, they | |
7953 remain around, and need to be explicitly deleted before they go away. | |
7954 @item | |
7955 Each buffer has a unique name, which is a string. Buffers are | |
7956 normally referred to by name. In this respect, they are like | |
7957 symbols. | |
7958 @item | |
7959 Buffers have a default insertion position, called @dfn{point}. | |
7960 Inserting text (unless you explicitly give a position) goes at point, | |
7961 and moves point forward past the text. This is what is going on when | |
7962 you type text into Emacs. | |
7963 @item | |
7964 Buffers have lots of extra properties associated with them. | |
7965 @item | |
7966 Buffers can be @dfn{displayed}. What this means is that there | |
7967 exist a number of @dfn{windows}, which are objects that correspond | |
7968 to some visible section of your display, and each window has | |
7969 an associated buffer, and the current contents of the buffer | |
7970 are shown in that section of the display. The redisplay mechanism | |
7971 (which takes care of doing this) knows how to look at the | |
7972 text of a buffer and come up with some reasonable way of displaying | |
7973 this. Many of the properties of a buffer control how the | |
7974 buffer's text is displayed. | |
7975 @item | |
7976 One buffer is distinguished and called the @dfn{current buffer}. It is | |
7977 stored in the variable @code{current_buffer}. Buffer operations operate | |
7978 on this buffer by default. When you are typing text into a buffer, the | |
7979 buffer you are typing into is always @code{current_buffer}. Switching | |
7980 to a different window changes the current buffer. Note that Lisp code | |
7981 can temporarily change the current buffer using @code{set-buffer} (often | |
7982 enclosed in a @code{save-excursion} so that the former current buffer | |
7983 gets restored when the code is finished). However, calling | |
7984 @code{set-buffer} will NOT cause a permanent change in the current | |
7985 buffer. The reason for this is that the top-level event loop sets | |
7986 @code{current_buffer} to the buffer of the selected window, each time | |
7987 it finishes executing a user command. | |
7988 @end enumerate | |
7989 | |
7990 Make sure you understand the distinction between @dfn{current buffer} | |
7991 and @dfn{buffer of the selected window}, and the distinction between | |
7992 @dfn{point} of the current buffer and @dfn{window-point} of the selected | |
7993 window. (This latter distinction is explained in detail in the section | |
7994 on windows.) | |
7995 | |
7996 @node Buffer Lists, Markers and Extents, Introduction to Buffers, Buffers | |
7997 @section Buffer Lists | |
7998 @cindex buffer lists | |
7999 | |
8000 Recall earlier that buffers are @dfn{permanent} objects, i.e. that | |
8001 they remain around until explicitly deleted. This entails that there is | |
8002 a list of all the buffers in existence. This list is actually an | |
8003 assoc-list (mapping from the buffer's name to the buffer) and is stored | |
8004 in the global variable @code{Vbuffer_alist}. | |
8005 | |
8006 The order of the buffers in the list is important: the buffers are | |
8007 ordered approximately from most-recently-used to least-recently-used. | |
8008 Switching to a buffer using @code{switch-to-buffer}, | |
8009 @code{pop-to-buffer}, etc. and switching windows using | |
8010 @code{other-window}, etc. usually brings the new current buffer to the | |
8011 front of the list. @code{switch-to-buffer}, @code{other-buffer}, | |
8012 etc. look at the beginning of the list to find an alternative buffer to | |
8013 suggest. You can also explicitly move a buffer to the end of the list | |
8014 using @code{bury-buffer}. | |
8015 | |
8016 In addition to the global ordering in @code{Vbuffer_alist}, each frame | |
8017 has its own ordering of the list. These lists always contain the same | |
8018 elements as in @code{Vbuffer_alist} although possibly in a different | |
8019 order. @code{buffer-list} normally returns the list for the selected | |
8020 frame. This allows you to work in separate frames without things | |
8021 interfering with each other. | |
8022 | |
8023 The standard way to look up a buffer given a name is | |
8024 @code{get-buffer}, and the standard way to create a new buffer is | |
8025 @code{get-buffer-create}, which looks up a buffer with a given name, | |
8026 creating a new one if necessary. These operations correspond exactly | |
8027 with the symbol operations @code{intern-soft} and @code{intern}, | |
8028 respectively. You can also force a new buffer to be created using | |
8029 @code{generate-new-buffer}, which takes a name and (if necessary) makes | |
8030 a unique name from this by appending a number, and then creates the | |
8031 buffer. This is basically like the symbol operation @code{gensym}. | |
8032 | |
8033 @node Markers and Extents, The Buffer Object, Buffer Lists, Buffers | |
8034 @section Markers and Extents | |
8035 @cindex markers and extents | |
8036 @cindex extents, markers and | |
8037 | |
8038 Among the things associated with a buffer are things that are | |
8039 logically attached to certain buffer positions. This can be used to | |
8040 keep track of a buffer position when text is inserted and deleted, so | |
8041 that it remains at the same spot relative to the text around it; to | |
8042 assign properties to particular sections of text; etc. There are two | |
8043 such objects that are useful in this regard: they are @dfn{markers} and | |
8044 @dfn{extents}. | |
8045 | |
8046 A @dfn{marker} is simply a flag placed at a particular buffer | |
8047 position, which is moved around as text is inserted and deleted. | |
8048 Markers are used for all sorts of purposes, such as the @code{mark} that | |
8049 is the other end of textual regions to be cut, copied, etc. | |
8050 | |
8051 An @dfn{extent} is similar to two markers plus some associated | |
8052 properties, and is used to keep track of regions in a buffer as text is | |
8053 inserted and deleted, and to add properties (e.g. fonts) to particular | |
8054 regions of text. The external interface of extents is explained | |
8055 elsewhere. | |
8056 | |
8057 The important thing here is that markers and extents simply contain | |
8058 buffer positions in them as integers, and every time text is inserted or | |
8059 deleted, these positions must be updated. In order to minimize the | |
8060 amount of shuffling that needs to be done, the positions in markers and | |
8061 extents (there's one per marker, two per extent) are stored in Membpos's. | |
8062 This means that they only need to be moved when the text is physically | |
8063 moved in memory; since the gap structure tries to minimize this, it also | |
8064 minimizes the number of marker and extent indices that need to be | |
8065 adjusted. Look in @file{insdel.c} for the details of how this works. | |
8066 | |
8067 One other important distinction is that markers are @dfn{temporary} | |
8068 while extents are @dfn{permanent}. This means that markers disappear as | |
8069 soon as there are no more pointers to them, and correspondingly, there | |
8070 is no way to determine what markers are in a buffer if you are just | |
8071 given the buffer. Extents remain in a buffer until they are detached | |
8072 (which could happen as a result of text being deleted) or the buffer is | |
8073 deleted, and primitives do exist to enumerate the extents in a buffer. | |
8074 | |
8075 @node The Buffer Object, , Markers and Extents, Buffers | |
8076 @section The Buffer Object | |
8077 @cindex buffer object, the | |
8078 @cindex object, the buffer | |
8079 | |
8080 Buffers contain fields not directly accessible by the Lisp programmer. | |
8081 We describe them here, naming them by the names used in the C code. | |
8082 Many are accessible indirectly in Lisp programs via Lisp primitives. | |
8083 | |
8084 @table @code | |
8085 @item name | |
8086 The buffer name is a string that names the buffer. It is guaranteed to | |
8087 be unique. @xref{Buffer Names,,, lispref, XEmacs Lisp Reference | |
8088 Manual}. | |
8089 | |
8090 @item save_modified | |
8091 This field contains the time when the buffer was last saved, as an | |
8092 integer. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference | |
8093 Manual}. | |
8094 | |
8095 @item modtime | |
8096 This field contains the modification time of the visited file. It is | |
8097 set when the file is written or read. Every time the buffer is written | |
8098 to the file, this field is compared to the modification time of the | |
8099 file. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference | |
8100 Manual}. | |
8101 | |
8102 @item auto_save_modified | |
8103 This field contains the time when the buffer was last auto-saved. | |
8104 | |
8105 @item last_window_start | |
8106 This field contains the @code{window-start} position in the buffer as of | |
8107 the last time the buffer was displayed in a window. | |
8108 | |
8109 @item undo_list | |
8110 This field points to the buffer's undo list. @xref{Undo,,, lispref, | |
8111 XEmacs Lisp Reference Manual}. | |
8112 | |
8113 @item syntax_table_v | |
8114 This field contains the syntax table for the buffer. @xref{Syntax | |
8115 Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
8116 | |
8117 @item downcase_table | |
8118 This field contains the conversion table for converting text to lower | |
8119 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
8120 | |
8121 @item upcase_table | |
8122 This field contains the conversion table for converting text to upper | |
8123 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
8124 | |
8125 @item case_canon_table | |
8126 This field contains the conversion table for canonicalizing text for | |
8127 case-folding search. @xref{Case Tables,,, lispref, XEmacs Lisp | |
8128 Reference Manual}. | |
8129 | |
8130 @item case_eqv_table | |
8131 This field contains the equivalence table for case-folding search. | |
8132 @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
8133 | |
8134 @item display_table | |
8135 This field contains the buffer's display table, or @code{nil} if it | |
8136 doesn't have one. @xref{Display Tables,,, lispref, XEmacs Lisp | |
8137 Reference Manual}. | |
8138 | |
8139 @item markers | |
8140 This field contains the chain of all markers that currently point into | |
8141 the buffer. Deletion of text in the buffer, and motion of the buffer's | |
8142 gap, must check each of these markers and perhaps update it. | |
8143 @xref{Markers,,, lispref, XEmacs Lisp Reference Manual}. | |
8144 | |
8145 @item backed_up | |
8146 This field is a flag that tells whether a backup file has been made for | |
8147 the visited file of this buffer. | |
8148 | |
8149 @item mark | |
8150 This field contains the mark for the buffer. The mark is a marker, | |
8151 hence it is also included on the list @code{markers}. @xref{The Mark,,, | |
8152 lispref, XEmacs Lisp Reference Manual}. | |
8153 | |
8154 @item mark_active | |
8155 This field is non-@code{nil} if the buffer's mark is active. | |
8156 | |
8157 @item local_var_alist | |
8158 This field contains the association list describing the variables local | |
8159 in this buffer, and their values, with the exception of local variables | |
8160 that have special slots in the buffer object. (Those slots are omitted | |
8161 from this table.) @xref{Buffer-Local Variables,,, lispref, XEmacs Lisp | |
8162 Reference Manual}. | |
8163 | |
8164 @item modeline_format | |
8165 This field contains a Lisp object which controls how to display the mode | |
8166 line for this buffer. @xref{Modeline Format,,, lispref, XEmacs Lisp | |
8167 Reference Manual}. | |
8168 | |
8169 @item base_buffer | |
8170 This field holds the buffer's base buffer (if it is an indirect buffer), | |
8171 or @code{nil}. | |
8172 @end table | |
8173 | |
8174 @node Text, Multilingual Support, Buffers, Top | |
8175 @chapter Text | |
8176 @cindex text | |
8177 | |
8178 @menu | |
8179 * The Text in a Buffer:: Representation of the text in a buffer. | |
8180 * Ibytes and Ichars:: Representation of individual characters. | |
8181 * Byte-Char Position Conversion:: | |
8182 * Searching and Matching:: Higher-level algorithms. | |
8183 @end menu | |
8184 | |
8185 @node The Text in a Buffer, Ibytes and Ichars, Text, Text | |
8186 @section The Text in a Buffer | |
8187 @cindex text in a buffer, the | |
8188 @cindex buffer, the text in a | |
8189 | |
8190 The text in a buffer consists of a sequence of zero or more | |
8191 characters. A @dfn{character} is an integer that logically represents | |
8192 a letter, number, space, or other unit of text. Most of the characters | |
8193 that you will typically encounter belong to the ASCII set of characters, | |
8194 but there are also characters for various sorts of accented letters, | |
8195 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana, | |
8196 etc.), Cyrillic and Greek letters, etc. The actual number of possible | |
8197 characters is quite large. | |
8198 | |
8199 For now, we can view a character as some non-negative integer that | |
8200 has some shape that defines how it typically appears (e.g. as an | |
8201 uppercase A). (The exact way in which a character appears depends on the | |
8202 font used to display the character.) The internal type of characters in | |
8203 the C code is an @code{Ichar}; this is just an @code{int}, but using a | |
8204 symbolic type makes the code clearer. | |
8205 | |
8206 Between every character in a buffer is a @dfn{buffer position} or | |
8207 @dfn{character position}. We can speak of the character before or after | |
8208 a particular buffer position, and when you insert a character at a | |
8209 particular position, all characters after that position end up at new | |
8210 positions. When we speak of the character @dfn{at} a position, we | |
8211 really mean the character after the position. (This schizophrenia | |
8212 between a buffer position being ``between'' two characters and ``on'' a | |
8213 character is rampant in Emacs.) | |
8214 | |
8215 Buffer positions are numbered starting at 1. This means that | |
8216 position 1 is before the first character, and position 0 is not | |
8217 valid. If there are N characters in a buffer, then buffer | |
8218 position N+1 is after the last one, and position N+2 is not valid. | |
8219 | |
8220 The internal makeup of the Ichar integer varies depending on whether | |
8221 we have compiled with MULE support. If not, the Ichar integer is an | |
8222 8-bit integer with possible values from 0 - 255. 0 - 127 are the | |
8223 standard ASCII characters, while 128 - 255 are the characters from the | |
8224 ISO-8859-1 character set. If we have compiled with MULE support, an | |
8225 Ichar is a 19-bit integer, with the various bits having meanings | |
8226 according to a complex scheme that will be detailed later. The | |
8227 characters numbered 0 - 255 still have the same meanings as for the | |
8228 non-MULE case, though. | |
8229 | |
8230 Internally, the text in a buffer is represented in a fairly simple | |
8231 fashion: as a contiguous array of bytes, with a @dfn{gap} of some size | |
8232 in the middle. Although the gap is of some substantial size in bytes, | |
8233 there is no text contained within it: From the perspective of the text | |
8234 in the buffer, it does not exist. The gap logically sits at some buffer | |
8235 position, between two characters (or possibly at the beginning or end of | |
8236 the buffer). Insertion of text in a buffer at a particular position is | |
8237 always accomplished by first moving the gap to that position | |
8238 (i.e. through some block moving of text), then writing the text into the | |
8239 beginning of the gap, thereby shrinking the gap. If the gap shrinks | |
8240 down to nothing, a new gap is created. (What actually happens is that a | |
8241 new gap is ``created'' at the end of the buffer's text, which requires | |
8242 nothing more than changing a couple of indices; then the gap is | |
8243 ``moved'' to the position where the insertion needs to take place by | |
8244 moving up in memory all the text after that position.) Similarly, | |
8245 deletion occurs by moving the gap to the place where the text is to be | |
8246 deleted, and then simply expanding the gap to include the deleted text. | |
8247 (@dfn{Expanding} and @dfn{shrinking} the gap as just described means | |
8248 just that the internal indices that keep track of where the gap is | |
8249 located are changed.) | |
8250 | |
8251 Note that the total amount of memory allocated for a buffer text never | |
8252 decreases while the buffer is live. Therefore, if you load up a | |
8253 20-megabyte file and then delete all but one character, there will be a | |
8254 20-megabyte gap, which won't get any smaller (except by inserting | |
8255 characters back again). Once the buffer is killed, the memory allocated | |
8256 for the buffer text will be freed, but it will still be sitting on the | |
8257 heap, taking up virtual memory, and will not be released back to the | |
8258 operating system. (However, if you have compiled XEmacs with rel-alloc, | |
8259 the situation is different. In this case, the space @emph{will} be | |
8260 released back to the operating system. However, this tends to result in a | |
8261 noticeable speed penalty.) | |
8262 | |
8263 Astute readers may notice that the text in a buffer is represented as | |
8264 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is | |
8265 a 19-bit integer, which clearly cannot fit in a byte. This means (of | |
8266 course) that the text in a buffer uses a different representation from | |
8267 an Ichar: specifically, the 19-bit Ichar becomes a series of one to | |
8268 four bytes. The conversion between these two representations is complex | |
8269 and will be described later. | |
8270 | |
8271 In the non-MULE case, everything is very simple: An Ichar | |
8272 is an 8-bit value, which fits neatly into one byte. | |
8273 | |
8274 If we are given a buffer position and want to retrieve the | |
8275 character at that position, we need to follow these steps: | |
8276 | |
8277 @enumerate | |
8278 @item | |
8279 Pretend there's no gap, and convert the buffer position into a @dfn{byte | |
8280 index} that indexes to the appropriate byte in the buffer's stream of | |
8281 textual bytes. By convention, byte indices begin at 1, just like buffer | |
8282 positions. In the non-MULE case, byte indices and buffer positions are | |
8283 identical, since one character equals one byte. | |
8284 @item | |
8285 Convert the byte index into a @dfn{memory index}, which takes the gap | |
8286 into account. The memory index is a direct index into the block of | |
8287 memory that stores the text of a buffer. This basically just involves | |
8288 checking to see if the byte index is past the gap, and if so, adding the | |
8289 size of the gap to it. By convention, memory indices begin at 1, just | |
8290 like buffer positions and byte indices, and when referring to the | |
8291 position that is @dfn{at} the gap, we always use the memory position at | |
8292 the @emph{beginning}, not at the end, of the gap. | |
8293 @item | |
8294 Fetch the appropriate bytes at the determined memory position. | |
8295 @item | |
8296 Convert these bytes into an Ichar. | |
8297 @end enumerate | |
8298 | |
8299 In the non-Mule case, (3) and (4) boil down to a simple one-byte | |
8300 memory access. | |
8301 | |
8302 Note that we have defined three types of positions in a buffer: | |
8303 | |
8304 @enumerate | |
8305 @item | |
8306 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos} | |
8307 @item | |
8308 @dfn{byte indices}, typedef @code{Bytebpos} | |
8309 @item | |
8310 @dfn{memory indices}, typedef @code{Membpos} | |
8311 @end enumerate | |
8312 | |
8313 All three typedefs are just @code{int}s, but defining them this way makes | |
8314 things a lot clearer. | |
8315 | |
8316 Most code works with buffer positions. In particular, all Lisp code | |
8317 that refers to text in a buffer uses buffer positions. Lisp code does | |
8318 not know that byte indices or memory indices exist. | |
8319 | |
8320 Finally, we have a typedef for the bytes in a buffer. This is a | |
8321 @code{Ibyte}, which is an unsigned char. Referring to them as | |
8322 Ibytes underscores the fact that we are working with a string of bytes | |
8323 in the internal Emacs buffer representation rather than in one of a | |
8324 number of possible alternative representations (e.g. EUC-encoded text, | |
8325 etc.). | |
8326 | |
8327 @node Ibytes and Ichars, Byte-Char Position Conversion, The Text in a Buffer, Text | |
8328 @section Ibytes and Ichars | |
8329 @cindex Ibytes and Ichars | |
8330 @cindex Ichars, Ibytes and | |
8331 | |
8332 Not yet documented. | |
8333 | |
8334 @node Byte-Char Position Conversion, Searching and Matching, Ibytes and Ichars, Text | |
8335 @section Byte-Char Position Conversion | |
8336 @cindex byte-char position conversion | |
8337 @cindex position conversion, byte-char | |
8338 @cindex conversion, byte-char position | |
8339 | |
8340 Oct 2004: | |
8341 | |
8342 This is what I wrote when describing the previous algorithm: | |
8343 | |
8344 @quotation | |
8345 The basic algorithm we use is to keep track of a known region of | |
8346 characters in each buffer, all of which are of the same width. We keep | |
8347 track of the boundaries of the region in both Charbpos and Bytebpos | |
8348 coordinates and also keep track of the char width, which is 1 - 4 bytes. | |
8349 If the position we're translating is not in the known region, then we | |
8350 invoke a function to update the known region to surround the position in | |
8351 question. This assumes locality of reference, which is usually the | |
8352 case. | |
8353 | |
8354 Note that the function to update the known region can be simple or | |
8355 complicated depending on how much information we cache. In addition to | |
8356 the known region, we always cache the correct conversions for point, | |
8357 BEGV, and ZV, and in addition to this we cache 16 positions where the | |
8358 conversion is known. We only look in the cache or update it when we | |
8359 need to move the known region more than a certain amount (currently 50 | |
8360 chars), and then we throw away a "random" value and replace it with the | |
8361 newly calculated value. | |
8362 | |
8363 Finally, we maintain an extra flag that tracks whether the buffer is | |
8364 entirely ASCII, to speed up the conversions even more. This flag is | |
8365 actually of dubious value because in an entirely-ASCII buffer the known | |
8366 region will always span the entire buffer (in fact, we update the flag | |
8367 based on this fact), and so all we're saving is a few machine cycles. | |
8368 | |
8369 A potentially smarter method than what we do with known regions and | |
8370 cached positions would be to keep some sort of pseudo-extent layer over | |
8371 the buffer; maybe keep track of the charbpos/bytebpos correspondence at | |
8372 the beginning of each line, which would allow us to do a binary search | |
8373 over the pseudo-extents to narrow things down to the correct line, at | |
8374 which point you could use a linear movement method. This would also | |
8375 mesh well with efficiently implementing a line-numbering scheme. | |
8376 However, you have to weigh the amount of time spent updating the cache | |
8377 vs. the savings that result from it. In reality, we modify the buffer | |
8378 far less often than we access it, so a cache of this sort that provides | |
8379 guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a | |
8380 maximum on the cache size) would indeed be a win, particularly in very | |
8381 large buffers. If we ever implement this, we should probably set a | |
8382 reasonably high minimum below which we use the old method, because the | |
8383 time spent updating the fancy cache would likely become dominant when | |
8384 making buffer modifications in smaller buffers. | |
8385 | |
8386 Note also that we have to multiply or divide by the char width in order | |
8387 to convert the positions. We do some tricks to avoid ever actually | |
8388 having to do a multiply or divide, because that is typically an | |
8389 expensive operation (esp. divide). Multiplying or dividing by 1, 2, or | |
8390 4 can be implemented simply as a shift left or shift right, and we keep | |
8391 track of a shifter value (0, 1, or 2) indicating how much to shift. | |
8392 Multiplying by 3 can be implemented by doubling and then adding the | |
8393 original value. Dividing by 3, alas, cannot be implemented in any | |
8394 simple shift/subtract method, as far as I know; so we just do a table | |
8395 lookup. For simplicity, we use a table of size 128K, which indexes the | |
8396 "divide-by-3" values for the first 64K non-negative numbers. (Note that | |
8397 we can increase the size up to 384K, i.e. indexing the first 192K | |
8398 non-negative numbers, while still using shorts in the array.) This also | |
8399 means that the size of the known region can be at most 64K for | |
8400 width-three characters. | |
8401 @end quotation | |
8402 | |
8403 Unfortunately, it turned out that the implementation had serious problems | |
8404 which had never been corrected. In particular, the known region had a | |
8405 large tendency to become zero-length and stay that way. | |
8406 | |
8407 So I decided to port the algorithm from FSF 21.3, in markers.c. | |
8408 | |
8409 This algorithm is fairly simple. Instead of using markers I kept the cache | |
8410 array of known positions from the previous implementation. | |
8411 | |
8412 Basically, we keep a number of positions cached: | |
7199 | 8413 |
7200 @itemize @bullet | 8414 @itemize @bullet |
7201 @item lisp objects | 8415 @item |
7202 @item other memory blocks (C structures, arrays. etc) | 8416 the actual end of the buffer |
8417 @item | |
8418 the beginning and end of the accessible region | |
8419 @item | |
8420 the value of point | |
8421 @item | |
8422 the position of the gap | |
8423 @item | |
8424 the last value we computed | |
8425 @item | |
8426 a set of positions that are "far away" from previously computed positions | |
8427 (5000 chars currently; #### perhaps should be smaller) | |
7203 @end itemize | 8428 @end itemize |
7204 | 8429 |
7205 We end up with one @code{pdump_block_list_elt} per object group (arrays | 8430 For each position, we @code{CONSIDER()} it. This means: |
7206 of C structs are kept together) which includes a pointer to the first | 8431 |
7207 object of the group, the per-object size and the count of objects in the | 8432 @itemize @bullet |
7208 group, along with some other information which is initialized later. | 8433 @item |
7209 | 8434 If the position is what we're looking for, return it directly. |
7210 These entries are linked together in @code{pdump_block_list} structures | 8435 @item |
7211 and can be enumerated thru either: | 8436 Starting with the beginning and end of the buffer, we successively |
8437 compute the smallest enclosing range of known positions. If at any | |
8438 point we discover that this range has the same byte and char length | |
8439 (i.e. is entirely single-byte), then our computation is trivial. | |
8440 @item | |
8441 If at any point we get a small enough range (50 chars currently), | |
8442 stop considering further positions. | |
8443 @end itemize | |
8444 | |
8445 Otherwise, once we have an enclosing range, see which side is closer, and | |
8446 iterate until we find the desired value. As an optimization, I replaced | |
8447 the simple loop in FSF with the use of @code{bytecount_to_charcount()}, | |
8448 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or | |
8449 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.) | |
8450 These scan 4 or 8 bytes at a time through purely single-byte characters. | |
8451 | |
8452 If the amount we had to scan was more than our "far away" distance (5000 | |
8453 characters, see above), then cache the new position. | |
8454 | |
8455 #### Things to do: | |
8456 | |
8457 @itemize @bullet | |
8458 @item | |
8459 Look at the most recent GNU Emacs to see whether anything has changed. | |
8460 @item | |
8461 Think about whether it makes sense to try to implement some sort of | |
8462 known region or list of "known regions", like we had before. This would | |
8463 be a region of entirely single-byte characters that we can check very | |
8464 quickly. (Previously I used a range of same-width characters of any | |
8465 size; but this adds extra complexity and slows down the scanning, and is | |
8466 probably not worth it.) As part of the scanning process in | |
8467 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely | |
8468 single-byte chars, so it should be easy to remember the last one. | |
8469 Presumably what we should do is keep track of the largest known surrounding | |
8470 entirely-single-byte region for each of the cache positions as well as | |
8471 perhaps the last-cached position. We want to be careful not to get bitten | |
8472 by the previous problem of having the known region getting reset too | |
8473 often. If we implement this, we might well want to continue scanning | |
8474 some distance past the desired position (maybe 300-1000 bytes) if we are | |
8475 in a single-byte range so that we won't end up expanding the known range | |
8476 one position at a time and entering the function each time. | |
8477 @item | |
8478 Think about whether it makes sense to keep the position cache sorted. | |
8479 This would allow it to be larger and finer-grained in its positions. | |
8480 Note that with FSF's use of markers, they were sorted, but this | |
8481 was not really made good use of. With an array, we can do binary searching | |
8482 to quickly find the smallest range. We would probably want to make use of | |
8483 the gap-array code in extents.c. | |
8484 @end itemize | |
8485 | |
8486 Note that FSF's algorithm checked @strong{ALL} markers, not just the ones cached | |
8487 by this algorithm. This includes markers created by the user as well as | |
8488 both ends of any overlays. We could do similarly, and our extents could | |
8489 keep both byte and character positions rather than just the former. (But | |
8490 this would probably be overkill. We should just use our cache instead. | |
8491 Any place an extent was set was surely already visited by the char<-->byte | |
8492 conversion routines.) | |
8493 | |
8494 @node Searching and Matching, , Byte-Char Position Conversion, Text | |
8495 @section Searching and Matching | |
8496 @cindex searching | |
8497 @cindex matching | |
8498 | |
8499 Very incomplete, limited to a brief introduction. | |
8500 | |
8501 People find the searching and matching code difficult to understand. | |
8502 And indeed, the details are hard. However, the basic structures are not | |
8503 so complex. First, there's a hard question with a simple answer. What | |
8504 about Mule? The answer here is that it turns out that Mule characters | |
8505 can be matched byte by byte, so neither the search code nor the regular | |
8506 expression code need take much notice of it at all! Of course, we add | |
8507 some special features (such as regular expressions that match only | |
8508 certain charsets), but these do not require new concepts. The main | |
8509 exception is that wild-card matches in Mule have to be careful to | |
8510 swallow whole characters. This is handled using the same basic macros | |
8511 that are used for buffer and string movements. | |
8512 | |
8513 This will also be true if a UTF-8 representation is used for the | |
8514 internal encoding. | |
8515 | |
8516 The complex algorithms for searching are for simple string searches. In | |
8517 particular, the algorithm used for fast string searching is Boyer-Moore. | |
8518 This algorithm is based on the idea that if you have a mismatch at a | |
8519 given position, you can precompute where to restart the search. This | |
8520 typically means that you can often make many fewer than N character | |
8521 comparisons, where N is the position at which the match is found, or the | |
8522 size of the text if it contains no match. That's fast! But it's not | |
8523 easy. You must ``compile'' the search string into a jump table. See | |
8524 the source, @file{search.c}, for more information. | |
8525 | |
8526 Emacs changes the basic algorithms somewhat in order to handle | |
8527 case-insensitive searches without a full-blown regular expression. | |
8528 | |
8529 Regular expressions, on the other hand, have a trivial search | |
8530 implementation: try a match at each position. (Under POSIX rules, it's | |
8531 a bit more complex, because POSIX requires that you find the | |
8532 @emph{longest} match in the text. This means you keep a record of the | |
8533 best match so far, and find all the matches.) | |
8534 | |
8535 The matching code for regular expressions is quite complex. First, the | |
8536 regular expression itself is compiled. There are two basic approaches | |
8537 that could be taken. The first is to compile the expression into tables | |
8538 to drive a generic finite automaton emulator. This is the approach | |
8539 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi, | |
8540 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka | |
8541 ``The Dragon Book'') as well as being used by the @file{lex} family of | |
8542 lexical analysis engines. | |
8543 | |
8544 Emacs uses a somewhat different technique. The expression is compiled | |
8545 into a form of bytecode, which is interpreted by a special interpreter. | |
8546 The interpreter itself basically amounts to an inline implementation of | |
8547 the finite automaton emulator. The advantage of this technique is that | |
8548 it's easier to add special features, such as control of case-sensitivity | |
8549 via a global variable. | |
8550 | |
8551 The compiler is not treated here. See the source, @file{regex.c}. The | |
8552 interpreter, although it is divided into several functions, and looks | |
8553 fearsomely complex, is actually quite simple in concept. However, | |
8554 basically what you're doing there is a strcmp on steroids, right? | |
8555 | |
8556 @example | |
8557 int | |
8558 strcmp (char *p, /* pattern pointer */ | |
8559 char *b) /* buffer pointer */ | |
8560 @{ | |
8561 while (*p++ == *b++) | |
8562 ; | |
8563 return *(--p) - *(--b); /* oops, we overshot */ | |
8564 @} | |
8565 @end example | |
8566 | |
8567 Really, it's no harder than that. (A bit of a white lie, OK?) | |
8568 | |
8569 How does the regexp code generalize this? | |
7212 | 8570 |
7213 @enumerate | 8571 @enumerate |
7214 @item | 8572 @item |
7215 the @code{pdump_object_table}, an array of @code{pdump_block_list}, one | 8573 Depending on the pattern, @code{*b} may have a general relationship to |
7216 per lrecord type, indexed by type number. | 8574 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is |
7217 | 8575 generalized to include checks for set membership, and context dependent |
7218 @item | 8576 properties. This depends on @code{&*b}. Of course that's meaningless |
7219 the @code{pdump_opaque_data_list}, used for the opaque data which does | 8577 in C, so we use @code{b} directly, instead. |
7220 not include pointers, and hence does not need descriptions. | 8578 |
7221 | 8579 @item |
7222 @item | 8580 Although to ensure the algorithm terminates, @code{b} must advance step |
7223 the @code{pdump_desc_table}, which is a vector of | 8581 by step, @code{p} can branch and jump. |
7224 @code{memory_description}/@code{pdump_block_list} pairs, used for | 8582 |
7225 non-opaque C memory blocks. | 8583 @item |
8584 The information returned is much greater, including information about | |
8585 subexpressions. | |
7226 @end enumerate | 8586 @end enumerate |
7227 | 8587 |
7228 This uses a marking strategy similar to the garbage collector. Some | 8588 We'll ignore (3). (2) is mostly interesting when compiling the regular |
7229 differences though: | 8589 expression. Now we have |
8590 | |
8591 @example | |
8592 @group | |
8593 enum operator_t @{ | |
8594 accept = 0, | |
8595 exact, | |
8596 any, | |
8597 range, | |
8598 group, /* actually, these are probably */ | |
8599 repeat, /* turned into conditional code */ | |
8600 /* etc */ | |
8601 @}; | |
8602 @end group | |
8603 | |
8604 @group | |
8605 enum status_t @{ | |
8606 working = 0, | |
8607 matched, | |
8608 mismatch, | |
8609 end_of_buffer, | |
8610 error | |
8611 @}; | |
8612 @end group | |
8613 | |
8614 @group | |
8615 struct pattern @{ | |
8616 enum operator_t operator; | |
8617 char char_value; | |
8618 boolean range_table[256]; | |
8619 /* etc, etc */ | |
8620 @}; | |
8621 @end group | |
8622 | |
8623 @group | |
8624 char *p, /* pattern pointer */ | |
8625 *b; /* buffer pointer */ | |
8626 | |
8627 enum status_t | |
8628 match (struct pattern *p, char *b) | |
8629 @{ | |
8630 enum status_t done = working; | |
8631 | |
8632 while (!(done = match_1_operator (p, b))) | |
8633 @{ | |
8634 struct pattern *p1 = p; | |
8635 p = next_p (p, b); | |
8636 b = next_b (p1, b); | |
8637 @} | |
8638 return done; | |
8639 @} | |
8640 @end group | |
8641 @end example | |
8642 | |
8643 This format exposes the underlying finite automaton. | |
8644 | |
8645 All of them have the following structure, except that the @samp{next_*} | |
8646 functions decide where to jump (for @samp{p}) and whether or not to | |
8647 increment (for @samp{b}), rather than checking for satisfaction of a | |
8648 matching condition. | |
8649 | |
8650 @example | |
8651 enum status_t | |
8652 match_1_operator (pattern *p, char *b) | |
8653 @{ | |
8654 if (! *b) return end_of_buffer; | |
8655 switch (p->operator) | |
8656 @{ | |
8657 case accept: | |
8658 return matched; | |
8659 case exact: | |
8660 if (*b != p->char_value) return mismatch; else break; | |
8661 case any: | |
8662 break; | |
8663 case range: | |
8664 /* range_table is computed in the regexp_compile function */ | |
8665 if (! p->range_table[*b]) return mismatch; | |
8666 /* etc, etc */ | |
8667 @} | |
8668 return working; | |
8669 @} | |
8670 @end example | |
8671 | |
8672 Grouping, repetition, and alternation are handled by compiling the | |
8673 subexpression and calling @code{match (p->subpattern, b)} recursively. | |
8674 | |
8675 In terms of reading the actual code, there are five optimizations | |
8676 (obfuscations, if you like) that have been done. | |
7230 | 8677 |
7231 @enumerate | 8678 @enumerate |
7232 @item | 8679 @item |
7233 We do not use the mark bit (which does not exist for generic memory blocks | 8680 An explicit "failure stack" has been substituted for recursion. |
7234 anyway); we use a big hash table instead. | 8681 |
7235 | 8682 @item |
7236 @item | 8683 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions |
7237 We do not use the mark function of lrecords but instead rely on the | 8684 are actually inlined into the @code{match} function for efficiency. |
7238 external descriptions. This happens essentially because we need to | 8685 Then the pointer movement is interspersed with the matching operations. |
7239 follow pointers to generic memory blocks and opaque data in addition to | 8686 |
7240 Lisp_Object members. | 8687 @item |
8688 If the operator uses buffer context, the buffer pointer movement is | |
8689 sometimes implicit in the operations retrieving the context. | |
8690 | |
8691 @item | |
8692 Some cases are combined into short preparation for individual cases, and | |
8693 a "fall-through" into combined code for several cases. | |
8694 | |
8695 @item | |
8696 The @code{pattern} type is not an explicit @samp{struct}. Instead, the | |
8697 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the | |
8698 compiled bytecode. This leads to bizarre code in the interpreter like | |
8699 | |
8700 @example | |
8701 case range: | |
8702 p += *(p + 1); break; | |
8703 @end example | |
8704 | |
8705 in @code{next_p}, because the compiled pattern is laid out | |
8706 | |
8707 @example | |
8708 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ... | |
8709 @end example | |
7241 @end enumerate | 8710 @end enumerate |
7242 | 8711 |
7243 This is done by @code{pdump_register_object()}, which handles | 8712 But if you keep your eye on the "switch in a loop" structure, you |
7244 Lisp_Object variables, and @code{pdump_register_block()} which handles | 8713 should be able to understand the parts you need. |
7245 generic memory blocks (C structures, arrays, etc.), which both delegate | 8714 |
7246 the description management to @code{pdump_register_sub()}. | 8715 @node Multilingual Support, Consoles; Devices; Frames; Windows, Text, Top |
7247 | 8716 @chapter Multilingual Support |
7248 The hash table doubles as a map object to pdump_block_list_elmt (i.e. | 8717 @cindex Mule character sets and encodings |
7249 allows us to look up a pdump_block_list_elmt with the object it points | 8718 @cindex character sets and encodings, Mule |
7250 to). Entries are added with @code{pdump_add_block()} and looked up with | 8719 @cindex encodings, Mule character sets and |
7251 @code{pdump_get_block()}. There is no need for entry removal. The hash | 8720 |
7252 value is computed quite simply from the object pointer by | 8721 @emph{NOTE}: There is a great deal of overlapping and redundant |
7253 @code{pdump_make_hash()}. | 8722 information in this chapter. Ben wrote introductions to Mule issues a |
7254 | 8723 number of times, each time not realizing that he had already written |
7255 The roots for the marking are: | 8724 another introduction previously. Hopefully, in time these will all be |
8725 integrated. | |
8726 | |
8727 @emph{NOTE}: The information at the top of the source file | |
8728 @file{text.c} is more complete than the following, and there is also a | |
8729 list of all other places to look for text/I18N-related info. Also look in | |
8730 @file{text.h} for info about the DFC and Eistring API's. | |
8731 | |
8732 Recall that there are two primary ways that text is represented in | |
8733 XEmacs. The @dfn{buffer} representation sees the text as a series of | |
8734 bytes (Ibytes), with a variable number of bytes used per character. | |
8735 The @dfn{character} representation sees the text as a series of integers | |
8736 (Ichars), one per character. The character representation is a cleaner | |
8737 representation from a theoretical standpoint, and is thus used in many | |
8738 cases when lots of manipulations on a string need to be done. However, | |
8739 the buffer representation is the standard representation used in both | |
8740 Lisp strings and buffers, and because of this, it is the ``default'' | |
8741 representation that text comes in. The reason for using this | |
8742 representation is that it's compact and is compatible with ASCII. | |
8743 | |
8744 @menu | |
8745 * Introduction to Multilingual Issues #1:: | |
8746 * Introduction to Multilingual Issues #2:: | |
8747 * Introduction to Multilingual Issues #3:: | |
8748 * Introduction to Multilingual Issues #4:: | |
8749 * Character Sets:: | |
8750 * Encodings:: | |
8751 * Internal Mule Encodings:: | |
8752 * Byte/Character Types; Buffer Positions; Other Typedefs:: | |
8753 * Internal Text API's:: | |
8754 * Coding for Mule:: | |
8755 * CCL:: | |
8756 * Microsoft Windows-Related Multilingual Issues:: | |
8757 * Modules for Internationalization:: | |
8758 @end menu | |
8759 | |
8760 @node Introduction to Multilingual Issues #1, Introduction to Multilingual Issues #2, Multilingual Support, Multilingual Support | |
8761 @section Introduction to Multilingual Issues #1 | |
8762 @cindex introduction to multilingual issues #1 | |
8763 | |
8764 There is an introduction to these issues in the Lisp Reference manual. | |
8765 @xref{Internationalization Terminology,,, lispref, XEmacs Lisp Reference | |
8766 Manual}. Among other documentation that may be of interest to internals | |
8767 programmers is ISO-2022 (@pxref{ISO 2022,,, lispref, XEmacs Lisp | |
8768 Reference Manual}) and CCL (@pxref{CCL,,, lispref, XEmacs Lisp Reference | |
8769 Manual}) | |
8770 | |
8771 @node Introduction to Multilingual Issues #2, Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #1, Multilingual Support | |
8772 @section Introduction to Multilingual Issues #2 | |
8773 @cindex introduction to multilingual issues #2 | |
8774 | |
8775 @subheading Introduction | |
8776 | |
8777 This document covers a number of design issues, problems and proposals | |
8778 with regards to XEmacs MULE. At first we present some definitions and | |
8779 some aspects of the design that have been agreed upon. Then we present | |
8780 some issues and problems that need to be addressed, and then I include a | |
8781 proposal of mine to address some of these issues. When there are other | |
8782 proposals, for example from Olivier, these will be appended to the end | |
8783 of this document. | |
8784 | |
8785 @subheading Definitions and Design Basics | |
8786 | |
8787 First, @dfn{text} is defined to be a series of characters which together | |
8788 defines an utterance or partial utterance in some language. | |
8789 Generally, this language is a human language, but it may also be a | |
8790 computer language if the computer language uses a representation close | |
8791 enough to that of human languages for it to also make sense to call its | |
8792 representation text. Text is opposed to @dfn{binary}, which is a sequence | |
8793 of bytes, representing machine-readable but not human-readable data. | |
8794 A @dfn{byte} is merely a number within a predefined range, which nowadays is | |
8795 nearly always zero to 255. A @dfn{character} is a unit of text. What makes | |
8796 one character different from another is not always clear-cut. It is | |
8797 generally related to the appearance of the character, although perhaps | |
8798 not any possible appearance of that character, but some sort of ideal | |
8799 appearance that is assigned to a character. Whether two characters | |
8800 that look very similar are actually the same depends on various | |
8801 factors such as political ones, such as whether the characters are | |
8802 used to mean similar sorts of things, or behave similarly in similar | |
8803 contexts. In any case, it is not always clearly defined whether two | |
8804 characters are actually the same or not. In practice, however, this | |
8805 is more or less agreed upon. | |
8806 | |
8807 A @dfn{character set} is just that, a set of one or more characters. | |
8808 The set is unique in that there will not be more than one instance of | |
8809 the same character in a character set, and logically is unordered, | |
8810 although an order is often imposed or suggested for the characters in | |
8811 the character set. We can also define an @dfn{order} on a character | |
8812 set, which is a way of assigning a unique number, or possibly a pair of | |
8813 numbers, or a triplet of numbers, or even a set of four or more numbers | |
8814 to each character in the character set. The combination of an order in | |
8815 the character set results in an @dfn{ordered character set}. In an | |
8816 ordered character set, there is an upper limit and a lower limit on the | |
8817 possible values that a character, or that any number within the set of | |
8818 numbers assigned to a character, can take. However, the lower limit | |
8819 does not have to start at zero or one, or anywhere else in particular, | |
8820 nor does the upper limit have to end anywhere particular, and there may | |
8821 be gaps within these ranges such that particular numbers or sets of | |
8822 numbers do not have a corresponding character, even though they are | |
8823 within the upper and lower limits. For example, @dfn{ASCII} defines a | |
8824 very standard ordered character set. It is normally defined to be 94 | |
8825 characters in the range 33 through 126 inclusive on both ends, with | |
8826 every possible character within this range being actually present in the | |
8827 character set. | |
8828 | |
8829 Sometimes the ASCII character set is extended to include what are called | |
8830 @dfn{non-printing characters}. Non-printing characters are characters | |
8831 which instead of really being displayed in a more or less rectangular | |
8832 block, like all other characters, instead indicate certain functions | |
8833 typically related to either control of the display upon which the | |
8834 characters are being displayed, or have some effect on a communications | |
8835 channel that may be currently open and transmitting characters, or may | |
8836 change the meaning of future characters as they are being decoded, or | |
8837 some other similar function. You might say that non-printing characters | |
8838 are somewhat of a hack because they are a special exception to the | |
8839 standard concept of a character as being a printed glyph that has some | |
8840 direct correspondence in the non-computer world. | |
8841 | |
8842 With non-printing characters in mind, the 94-character ordered character | |
8843 set called ASCII is often extended into a 96-character ordered character | |
8844 set, also often called ASCII, which includes in addition to the 94 | |
8845 characters already mentioned, two non-printing characters, one called | |
8846 space and assigned the number 32, just below the bottom of the previous | |
8847 range, and another called @dfn{delete} or @dfn{rubout}, which is given | |
8848 number 127 just above the end of the previous range. Thus to reiterate, | |
8849 the result is a 96-character ordered character set, whose characters | |
8850 take the values from 32 to 127 inclusive. Sometimes ASCII is further | |
8851 extended to contain 32 more non-printing characters, which are given the | |
8852 numbers zero through 31 so that the result is a 128-character ordered | |
8853 character set with characters numbered zero through 127, and with many | |
8854 non-printing characters. Another way to look at this, and the way that | |
8855 is normally taken by XEmacs MULE, is that the characters that would be | |
8856 in the range 30 through 31 in the most extended definition of ASCII, | |
8857 instead form their own ordered character set, which is called | |
8858 @dfn{control zero}, and consists of 32 characters in the range zero | |
8859 through 31. A similar ordered character set called @dfn{control one} is | |
8860 also created, and it contains 32 more non-printing characters in the | |
8861 range 128 through 159. Note that none of these three ordered character | |
8862 sets overlaps in any of the numbers they are assigned to their | |
8863 characters, so they can all be used at once. Note further that the same | |
8864 character can occur in more than one character set. This was shown | |
8865 above, for example, in two different ordered character sets we defined, | |
8866 one of which we could have called @dfn{ASCII}, and the other | |
8867 @dfn{ASCII-extended}, to show that it had extended by two non-printable | |
8868 characters. Most of the characters in these two character sets are | |
8869 shared and present in both of them. | |
8870 | |
8871 Note that there is no restriction on the size of the character set, or | |
8872 on the numbers that are assigned to characters in an ordered character | |
8873 set. It is often extremely useful to represent a sequence of characters | |
8874 as a sequence of bytes, where a byte as defined above is a number in the | |
8875 range zero to 255. An @dfn{encoding} does precisely this. It is simply | |
8876 a mapping from a sequence of characters, possibly augmented with | |
8877 information indicating the character set that each of these characters | |
8878 belongs to, to a sequence of bytes which represents that sequence of | |
8879 characters and no other -- which is to say the mapping is reversible. | |
8880 | |
8881 A @dfn{coding system} is a set of rules for encoding a sequence of | |
8882 characters augmented with character set information into a sequence of | |
8883 bytes, and later performing the reverse operation. It is frequently | |
8884 possible to group coding systems into classes or types based on common | |
8885 features. Typically, for example, a particular coding system class | |
8886 may contain a base coding system which specifies some of the rules, | |
8887 but leaves the rest unspecified. Individual members of the coding | |
8888 system class are formed by starting with the base coding system, and | |
8889 augmenting it with additional rules to produce a particular coding | |
8890 system, what you might think of as a sort of variation within a | |
8891 theme. | |
8892 | |
8893 @subheading XEmacs Specific Definitions | |
8894 | |
8895 First of all, in XEmacs, the concept of character is a little different | |
8896 from the general definition given above. For one thing, the character | |
8897 set that a character belongs to may or may not be an inherent part of | |
8898 the character itself. In other words, the same character occurring in | |
8899 two different character sets may appear in XEmacs as two different | |
8900 characters. This is generally the case now, but we are attempting to | |
8901 move in the other direction. Different proposals may have different | |
8902 ideas about exactly the extent to which this change will be carried out. | |
8903 The general trend, though, is to represent all information about a | |
8904 character other than the character itself, using text properties | |
8905 attached to the character. That way two instances of the same character | |
8906 will look the same to lisp code that merely retrieves the character, and | |
8907 does not also look at the text properties of that character. Everyone | |
8908 involved is in agreement in doing it this way with all Latin characters, | |
8909 and in fact for all characters other than Chinese, Japanese, and Korean | |
8910 ideographs. For those, there may be a difference of opinion. | |
8911 | |
8912 A second difference between the general definition of character and the | |
8913 XEmacs usage of character is that each character is assigned a unique | |
8914 number that distinguishes it from all other characters in the world, or | |
8915 at the very least, from all other characters currently existing anywhere | |
8916 inside the current XEmacs invocation. (If there is a case where the | |
8917 weaker statement applies, but not the stronger statement, it would | |
8918 possibly be with composite characters and any other such characters that | |
8919 are created on the sly.) | |
8920 | |
8921 This unique number is called the @dfn{character representation} of the | |
8922 character, and its particular details are a matter of debate. There is | |
8923 the current standard in use that it is undoubtedly going to change. | |
8924 What has definitely been agreed upon is that it will be an integer, more | |
8925 specifically a positive integer, represented with less than or equal to | |
8926 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit | |
8927 architecture, with the proviso that any characters that whose | |
8928 representation would fit in a 64-bit architecture, but not on a 32-bit | |
8929 architecture, would be used only for composite characters, and others | |
8930 that would satisfy the weak uniqueness property mentioned above, but not | |
8931 with the strong uniqueness property. | |
8932 | |
8933 At this point, it is useful to talk about the different representations | |
8934 that a sequence of characters can take. The simplest representation is | |
8935 simply as a sequence of characters, and this is called the @dfn{Lisp | |
8936 representation} of text, because it is the representation that Lisp | |
8937 programs see. Other representations include the external | |
8938 representation, which refers to any encoding of the sequence of | |
8939 characters, using the definition of encoding mentioned above. | |
8940 Typically, text in the external representation is used outside of | |
8941 XEmacs, for example in files, e-mail messages, web sites, and the like. | |
8942 Another representation for a sequence of characters is what I will call | |
8943 the @dfn{byte representation}, and it represents the way that XEmacs | |
8944 internally represents text in a buffer, or in a string. Potentially, | |
8945 the representation could be different between a buffer and a string, and | |
8946 then the terms @dfn{buffer byte representation} and @dfn{string byte | |
8947 representation} would be used, but in practice I don't think this will | |
8948 occur. It will be possible, of course, for buffers and strings, or | |
8949 particular buffers and particular strings, to contain different | |
8950 sub-representations of a single representation. For example, Olivier's | |
8951 1-2-4 proposal allows for three sub-representations of his internal byte | |
8952 representation, allowing for 1 byte, 2 bytes, and 4 byte width | |
8953 characters respectively. A particular string may be in one | |
8954 sub-representation, and a particular buffer in another | |
8955 sub-representation, but overall both are following the same byte | |
8956 representation. I do not use the term @dfn{internal representation} | |
8957 here, as many people have, because it is potentially ambiguous. | |
8958 | |
8959 Another representation is called the @dfn{array of characters | |
8960 representation}. This is a representation on the C-level in which the | |
8961 sequence of text is represented, not using the byte representation, but | |
8962 by using an array of characters, each represented using the character | |
8963 representation. This sort of representation is often used by redisplay | |
8964 because it is more convenient to work with than any of the other | |
8965 internal representations. | |
8966 | |
8967 The term @dfn{binary representation} may also be heard. Binary | |
8968 representation is used to represent binary data. When binary data is | |
8969 represented in the lisp representation, an equivalence is simply set up | |
8970 between bytes zero through 255, and characters zero through 255. These | |
8971 characters come from four character sets, which are from bottom to top, | |
8972 control zero, ASCII, control 1, and Latin 1. Together, they comprise | |
8973 256 characters, and are a good mapping for the 256 possible bytes in a | |
8974 binary representation. Binary representation could also be used to | |
8975 refer to an external representation of the binary data, which is a | |
8976 simple direct byte-to-byte representation. No internal representation | |
8977 should ever be referred to as a binary representation because of | |
8978 ambiguity. The terms character set/encoding system were defined | |
8979 generally, above. In XEmacs, the equivalent concepts exist, although | |
8980 character set has been shortened to charset, and in fact represents | |
8981 specifically an ordered character set. For each possible charset, and | |
8982 for each possible coding system, there is an associated object in | |
8983 XEmacs. These objects will be of type charset and coding system, | |
8984 respectively. Charsets and coding systems are divided into classes, or | |
8985 @dfn{types}, the normal term under XEmacs, and all possible charsets | |
8986 encoding systems that may be defined must be in one of these types. If | |
8987 you need to create a charset or coding system that is not one of these | |
8988 types, you will have to modify the C code to support this new type. | |
8989 Some of the existing or soon-to-be-created types are, or will be, | |
8990 generic enough so that this shouldn't be an issue. Note also that the | |
8991 byte encoding for text and the character coding of a character are | |
8992 closely related. You might say that ideally each is the simplest | |
8993 equivalent of the other given the general constraints on each | |
8994 representation. | |
8995 | |
8996 To be specific, in the current MULE representation, | |
7256 | 8997 |
7257 @enumerate | 8998 @enumerate |
7258 @item | 8999 @item |
7259 the @code{staticpro}'ed variables (there is a special | 9000 Characters encode both the character itself and the character set |
7260 @code{staticpro_nodump()} call for protected variables we do not want to | 9001 that it comes from. These character sets are always assumed to be |
7261 dump). | 9002 representable as an ordered character set of size 96 or of size 96 |
7262 | 9003 by 96, or the trivially-related sizes 94 and 94 by 94. The only |
7263 @item | 9004 allowable exceptions are the control zero and control one character |
7264 the Lisp_Object variables registered via @code{dump_add_root_lisp_object} | 9005 sets, which are of size 32. Character sets which do not naturally |
7265 (@code{staticpro()} is equivalent to @code{staticpro_nodump()} + | 9006 have a compatible ordering such as this are shoehorned into an |
7266 @code{dump_add_root_lisp_object()}). | 9007 ordered character set, or possibly two ordered character sets of a |
7267 | 9008 compatible size. |
7268 @item | 9009 @item |
7269 the data-segment memory blocks registered via @code{dump_add_root_block} | 9010 The variable width byte representation was deliberately chosen to |
7270 (for blocks with relocatable pointers), or @code{dump_add_opaque} (for | 9011 allow scanning text forwards and backwards efficiently. This |
7271 "opaque" blocks with no relocatable pointers; this is just a shortcut | 9012 necessitated defining the possible bytes into three ranges which |
7272 for calling @code{dump_add_root_block} with a NULL description). | 9013 we shall call A, B, and C. Range A is used exclusively for |
7273 | 9014 single-byte characters, which is to say characters that are |
7274 @item | 9015 representing using only one contiguous byte. Multi-byte |
7275 the pointer variables registered via @code{dump_add_root_block_ptr}, | 9016 characters are always represented by using one byte from Range B, |
7276 each of which points to a block of heap memory (generally a C structure | 9017 followed by one or more bytes from Range C. What this means is |
7277 or array). Note that @code{dump_add_root_block_ptr} is not technically | 9018 that bytes that begin a character are unequivocally distinguished |
7278 necessary, as a pointer variable can be seen as a special case of a | 9019 from bytes that do not begin a character, and therefore there is |
7279 data-segment memory block and registered using | 9020 never a problem scaling backwards and finding the beginning of a |
7280 @code{dump_add_root_block}. Doing it this way, however, would require | 9021 character. Know that UTF8 adopts a proposal that is very similar |
7281 another level of static structures declared. Since pointer variables | 9022 in spirit in that it uses separate ranges for the first byte of a |
7282 are quite common, @code{dump_add_root_block_ptr} is provided for | 9023 multi byte sequence, and the following bytes in multi-byte |
7283 convenience. Note also that internally we have to treat it separately | 9024 sequence. |
7284 from @code{dump_add_root_block} rather than writing the former as a call | 9025 @item |
7285 to the latter, since we don't have support for creating and using memory | 9026 Given the fact that all ordered character sets allowed were |
7286 descriptions on the fly -- they must all be statically declared in the | 9027 essentially 96 characters per dimension, it made perfect sense to |
7287 data-segment. | 9028 make Range C comprise 96 bytes. With a little more tweaking, the |
9029 currently-standard MULE byte representation was created, and was | |
9030 drafted from this. | |
9031 @item | |
9032 The MULE byte representation defined four basic representations for | |
9033 characters, which would take up from one to four bytes, | |
9034 respectively. The MULE character representation thus had the | |
9035 following constraints: | |
9036 @enumerate | |
9037 @item | |
9038 Character numbers zero through 255 should represent the | |
9039 characters that binary values zero through 255 would be | |
9040 mapped onto. (Note: this was not the case in Kenichi Handa's | |
9041 version of this representation, but I changed it.) | |
9042 @item | |
9043 The four sub-classes of representation in the MULE byte | |
9044 representation should correspond to four contiguous | |
9045 non-overlapping ranges of characters. | |
9046 @item | |
9047 The algorithmic conversion between the single character | |
9048 represented in the byte representation and in the character | |
9049 representation should be as easy as possible. | |
9050 @item | |
9051 Given the previous constraints, the character representation | |
9052 should be as compact as possible, which is to say it should | |
9053 use the least number of bits possible. | |
7288 @end enumerate | 9054 @end enumerate |
7289 | 9055 @end enumerate |
7290 This does not include the GCPRO'ed variables, the specbinds, the | 9056 |
7291 catchtags, the backlist, the redisplay or the profiling info, since we | 9057 So you see that the entire structure of the byte and character |
7292 do not want to rebuild the actual chain of lisp calls which end up to | 9058 representations stemmed from a very small number of basic choices, |
7293 the dump-emacs call, only the global variables. | 9059 which were |
7294 | |
7295 Weak lists and weak hash tables are dumped as if they were their | |
7296 non-weak equivalent (without changing their type, of course). This has | |
7297 not yet been a problem. | |
7298 | |
7299 @node Address allocation, The header, Object inventory, Dumping phase | |
7300 @subsection Address allocation | |
7301 @cindex dumping address allocation | |
7302 | |
7303 | |
7304 The next step is to allocate the offsets of each of the objects in the | |
7305 final dump file. This is done by @code{pdump_allocate_offset()} which | |
7306 is called indirectly by @code{pdump_scan_by_alignment()}. | |
7307 | |
7308 The strategy to deal with alignment problems uses these facts: | |
7309 | 9060 |
7310 @enumerate | 9061 @enumerate |
7311 @item | 9062 @item |
7312 real world alignment requirements are powers of two. | 9063 the choice to encode character set information in a character |
7313 | 9064 @item |
7314 @item | 9065 the choice to assume that all character sets would have an order |
7315 the C compiler is required to adjust the size of a struct so that you | 9066 imposed upon them with 96 characters per one or two |
7316 can have an array of them next to each other. This means you can have an | 9067 dimensions. (This is less arbitrary than it seems--it follows |
7317 upper bound of the alignment requirements of a given structure by | 9068 ISO-2022) |
7318 looking at which power of two its size is a multiple. | 9069 @item |
7319 | 9070 the choice to use a variable width byte representation. |
7320 @item | |
7321 the non-variant part of variable size lrecords has an alignment | |
7322 requirement of 4. | |
7323 @end enumerate | 9071 @end enumerate |
7324 | 9072 |
7325 Hence, for each lrecord type, C struct type or opaque data block the | 9073 What this means is that you cannot really separate the byte |
7326 alignment requirement is computed as a power of two, with a minimum of | 9074 representation, the character representation, and the assumptions made |
7327 2^2 for lrecords. @code{pdump_scan_by_alignment()} then scans all the | 9075 about characters and whether they represent character sets from each |
7328 @code{pdump_block_list_elmt}'s, the ones with the highest requirements | 9076 other. All of these are closely intertwined, and for purposes of |
7329 first. This ensures the best packing. | 9077 simplicity, they should be designed together. If you change one |
7330 | 9078 representation without changing another, you are in essence creating a |
7331 The maximum alignment requirement we take into account is 2^8. | 9079 completely new design with its own attendant problems--since your new |
7332 | 9080 design is likely to be quite complex and not very coherent with |
7333 @code{pdump_allocate_offset()} only has to do a linear allocation, | 9081 regards to the translation between the character and byte |
7334 starting at offset 256 (this leaves room for the header and keeps the | 9082 representations, you are likely to run into problems. |
7335 alignments happy). | 9083 |
7336 | 9084 @node Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #4, Introduction to Multilingual Issues #2, Multilingual Support |
7337 @node The header, Data dumping, Address allocation, Dumping phase | 9085 @section Introduction to Multilingual Issues #3 |
7338 @subsection The header | 9086 @cindex introduction to multilingual issues #3 |
7339 @cindex dumping, the header | 9087 |
7340 | 9088 In XEmacs, Mule is a code word for the support for input handling and |
7341 The next step creates the file and writes a header with a signature and | 9089 display of multi-lingual text. This section provides an overview of how |
7342 some random information in it. The @code{reloc_address} field, which | 9090 this support impacts the C and Lisp code in XEmacs. It is important for |
7343 indicates at which address the file should be loaded if we want to avoid | 9091 anyone who works on the C or the Lisp code, especially on the C code, to |
7344 post-reload relocation, is set to 0. It then seeks to offset 256 (base | 9092 be aware of these issues, even if they don't work directly on code that |
7345 offset for the objects). | 9093 implements multi-lingual features, because there are various general |
7346 | 9094 procedures that need to be followed in order to write Mule-compliant |
7347 @node Data dumping, Pointers dumping, The header, Dumping phase | 9095 code. (The specifics of these procedures are documented elsewhere in |
7348 @subsection Data dumping | 9096 this manual.) |
7349 @cindex data dumping | 9097 |
7350 @cindex dumping, data | 9098 There are four primary aspects of Mule support: |
7351 | |
7352 The data is dumped in the same order as the addresses were allocated by | |
7353 @code{pdump_dump_data()}, called from @code{pdump_scan_by_alignment()}. | |
7354 This function copies the data to a temporary buffer, relocates all | |
7355 pointers in the object to the addresses allocated in step Address | |
7356 Allocation, and writes it to the file. Using the same order means that, | |
7357 if we are careful with lrecords whose size is not a multiple of 4, we | |
7358 are ensured that the object is always written at the offset in the file | |
7359 allocated in step Address Allocation. | |
7360 | |
7361 @node Pointers dumping, , Data dumping, Dumping phase | |
7362 @subsection Pointers dumping | |
7363 @cindex pointers dumping | |
7364 @cindex dumping, pointers | |
7365 | |
7366 A bunch of tables needed to reassign properly the global pointers are | |
7367 then written. They are: | |
7368 | 9099 |
7369 @enumerate | 9100 @enumerate |
7370 @item | 9101 @item |
7371 the pdump_root_block_ptrs dynarr | 9102 internal handling and representation of multi-lingual text. |
7372 @item | 9103 @item |
7373 the pdump_opaques dynarr | 9104 conversion between the internal representation of text and the various |
7374 @item | 9105 external representations in which multi-lingual text is encoded, such as |
7375 a vector of all the offsets to the objects in the file that include a | 9106 Unicode representations (including mostly fixed width encodings such as |
7376 description (for faster relocation at reload time) | 9107 UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings, |
7377 @item | 9108 such as UTF-7 and UTF-8); the various ISO2022 representations, which |
7378 the pdump_root_objects and pdump_weak_object_chains dynarrs. | 9109 typically use escape sequences to switch between different character |
9110 sets (such as Compound Text, used under X Windows; JIS, used | |
9111 specifically for encoding Japanese; and EUC, a non-modal encoding used | |
9112 for Japanese, Korean, and certain other languages); Microsoft's | |
9113 multi-byte encodings (such as Shift-JIS); various simple encodings for | |
9114 particular 8-bit character sets (such as Latin-1 and Latin-2, and | |
9115 encodings (such as koi8 and Alternativny) for Cyrillic); and others. | |
9116 This conversion needs to happen both for text in files and text sent to | |
9117 or retrieved from system API calls. It even needs to happen for | |
9118 external binary data because the internal representation does not | |
9119 represent binary data simply as a sequence of bytes as it is represented | |
9120 externally. | |
9121 @item | |
9122 Proper display of multi-lingual characters. | |
9123 @item | |
9124 Input of multi-lingual text using the keyboard. | |
7379 @end enumerate | 9125 @end enumerate |
7380 | 9126 |
7381 For each of the dynarrs we write both the pointer to the variables and | 9127 These four aspects are for the most part independent of each other. |
7382 the relocated offset of the object they point to. Since these variables | 9128 |
7383 are global, the pointers are still valid when restarting the program and | 9129 @subheading Characters, Character Sets, and Encodings |
7384 are used to regenerate the global pointers. | 9130 |
7385 | 9131 A @dfn{character} (which is, BTW, a surprisingly complex concept) is, in |
7386 The @code{pdump_weak_object_chains} dynarr is a special case. The | 9132 a written representation of text, the most basic written unit that has a |
7387 variables it points to are the head of weak linked lists of lisp objects | 9133 meaning of its own. It's comparable to a phoneme when analyzing words |
7388 of the same type. Not all objects of this list are dumped so the | 9134 in spoken speech (for example, the sound of @samp{t} in English, which |
7389 relocated pointer we associate with them points to the first dumped | 9135 in fact has different pronunciations in different words -- aspirated in |
7390 object of the list, or Qnil if none is available. This is also the | 9136 @samp{time}, unaspirated in @samp{stop}, unreleased or even pronounced |
7391 reason why they are not used as roots for the purpose of object | 9137 as a glottal stop in @samp{button}, etc. -- but logically is a single |
7392 enumeration. | 9138 concept). Like a phoneme, a character is an abstract concept defined by |
7393 | 9139 its @emph{meaning}. The character @samp{lowercase f}, for example, can |
7394 Some very important information like the @code{staticpros} and | 9140 always be used to represent the first letter in the word @samp{fill}, |
7395 @code{lrecord_implementations_table} are handled indirectly using | 9141 regardless of whether it's drawn upright or italic, whether the |
7396 @code{dump_add_opaque} or @code{dump_add_root_block_ptr}. | 9142 @samp{fi} combination is drawn as a single ligature, whether there are |
7397 | 9143 serifs on the bottom of the vertical stroke, etc. (These different |
7398 This is the end of the dumping part. | 9144 appearances of a single character are often called @dfn{graphs} or |
7399 | 9145 @dfn{glyphs}.) Our concern when representing text is on representing the |
7400 @node Reloading phase, Remaining issues, Dumping phase, Dumping | 9146 abstract characters, and not on their exact appearance. |
7401 @section Reloading phase | 9147 |
7402 @cindex reloading phase | 9148 A @dfn{character set} (or @dfn{charset}), as we define it, is a set of |
7403 @cindex dumping, reloading phase | 9149 characters, each with an associated number (or set of numbers -- see |
7404 | 9150 below), called a @dfn{code point}. It's important to understand that a |
7405 @subsection File loading | 9151 character is not defined by any number attached to it, but by its |
7406 @cindex dumping, file loading | 9152 meaning. For example, ASCII and EBCDIC are two charsets containing |
7407 | 9153 exactly the same characters (lowercase and uppercase letters, numbers 0 |
7408 The file is mmap'ed in memory (which ensures a PAGESIZE alignment, at | 9154 through 9, particular punctuation marks) but with different |
7409 least 4096), or if mmap is unavailable or fails, a 256-bytes aligned | 9155 numberings. The `comma' character in ASCII and EBCDIC, for instance, is |
7410 malloc is done and the file is loaded. | 9156 the same character despite having a different numbering. Conversely, |
7411 | 9157 when comparing ASCII and JIS-Roman, which look the same except that the |
7412 Some variables are reinitialized from the values found in the header. | 9158 latter has a yen sign substituted for the backslash, we would say that |
7413 | 9159 the backslash and yen sign are @strong{not} the same characters, despite having |
7414 The difference between the actual loading address and the reloc_address | 9160 the same number (95) and despite the fact that all other characters are |
7415 is computed and will be used for all the relocations. | 9161 present in both charsets, with the same numbering. ASCII and JIS-Roman, |
7416 | 9162 then, do @emph{not} have exactly the same characters in them (ASCII has |
7417 | 9163 a backslash character but no yen-sign character, and vice-versa for |
7418 @subsection Putting back the pdump_opaques | 9164 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII |
7419 @cindex dumping, putting back the pdump_opaques | 9165 and JIS-Roman are closer. |
7420 | 9166 |
7421 The memory contents are restored in the obvious and trivial way. | 9167 It's also important to distinguish between charsets and encodings. For |
7422 | 9168 a simple charset like ASCII, there is only one encoding normally used -- |
7423 | 9169 each character is represented by a single byte, with the same value as |
7424 @subsection Putting back the pdump_root_block_ptrs | 9170 its code point. For more complicated charsets, however, things are not |
7425 @cindex dumping, putting back the pdump_root_block_ptrs | 9171 so obvious. Unicode version 2, for example, is a large charset with |
7426 | 9172 thousands of characters, each indexed by a 16-bit number, often |
7427 The variables pointed to by pdump_root_block_ptrs in the dump phase are | 9173 represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One |
7428 reset to the right relocated object addresses. | 9174 obvious encoding uses two bytes per character (actually two encodings, |
7429 | 9175 depending on which of the two possible byte orderings is chosen). This |
7430 | 9176 encoding is convenient for internal processing of Unicode text; however, |
7431 @subsection Object relocation | 9177 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is |
7432 @cindex dumping, object relocation | 9178 usually used for external text, for example files or e-mail. UTF-8 |
7433 | 9179 represents Unicode characters with one to three bytes (often extended to |
7434 All the objects are relocated using their description and their offset | 9180 six bytes to handle characters with up to 31-bit indices). Unicode |
7435 by @code{pdump_reloc_one}. This step is unnecessary if the | 9181 characters 00 to 7F (identical with ASCII) are directly represented with |
7436 reloc_address is equal to the file loading address. | 9182 one byte, and other characters with two or more bytes, each in the range |
7437 | 9183 80 to FF. |
7438 | 9184 |
7439 @subsection Putting back the pdump_root_objects and pdump_weak_object_chains | 9185 In general, a single encoding may be able to represent more than one |
7440 @cindex dumping, putting back the pdump_root_objects and pdump_weak_object_chains | 9186 charset. |
7441 | 9187 |
7442 Same as Putting back the pdump_root_block_ptrs. | 9188 @subheading Internal Representation of Text |
7443 | 9189 |
7444 | 9190 In an ASCII or single-European-character-set world, life is very simple. |
7445 @subsection Reorganize the hash tables | 9191 There are 256 characters, and each character is represented using the |
7446 @cindex dumping, reorganize the hash tables | 9192 numbers 0 through 255, which fit into a single byte. With a few |
7447 | 9193 exceptions (such as case-changing operations or syntax classes like |
7448 Since some of the hash values in the lisp hash tables are | 9194 'whitespace'), "text" is simply an array of indices into a font. You |
7449 address-dependent, their layout is now wrong. So we go through each of | 9195 can get different languages simply by choosing fonts with different |
7450 them and have them resorted by calling @code{pdump_reorganize_hash_table}. | 9196 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and |
7451 | 9197 everything will "just work" as long as anyone else receiving your text |
7452 @node Remaining issues, , Reloading phase, Dumping | 9198 uses a compatible font. |
7453 @section Remaining issues | 9199 |
7454 @cindex dumping, remaining issues | 9200 In the multi-lingual world, however, it is much more complicated. There |
7455 | 9201 are a great number of different characters which are organized in a |
7456 The build process will have to start a post-dump xemacs, ask it the | 9202 complex fashion into various character sets. The representation to use |
7457 loading address (which will, hopefully, be always the same between | 9203 is not obvious because there are issues of size versus speed to |
7458 different xemacs invocations) [[unfortunately, not true on Linux with | 9204 consider. In fact, there are in general two kinds of representations to |
7459 the ExecShield feature]] and relocate the file to the new address. | 9205 work with: one that represents a single character using an integer |
7460 This way the object relocation phase will not have to be done, which | 9206 (possibly a byte), and the other representing a single character as a |
7461 means no writes in the objects and that, because of the use of mmap, the | 9207 sequence of bytes. The former representation is normally called fixed |
7462 dumped data will be shared between all the xemacs running on the | 9208 width, and the other variable width. Both representations represent |
7463 computer. | 9209 exactly the same characters, and the conversion from one representation |
7464 | 9210 to the other is governed by a specific formula (rather than by table |
7465 Some executable signature will be necessary to ensure that a given dump | 9211 lookup) but it may not be simple. Most C code need not, and in fact |
7466 file is really associated with a given executable, or random crashes | 9212 should not, know the specifics of exactly how the representations work. |
7467 will occur. Maybe a random number set at compile or configure time thru | 9213 In fact, the code must not make assumptions about the representations. |
7468 a define. This will also allow for having differently-compiled xemacsen | 9214 This means in particular that it must use the proper macros for |
7469 on the same system (mule and no-mule comes to mind). | 9215 retrieving the character at a particular memory location, determining |
7470 | 9216 how many characters are present in a particular stretch of text, and |
7471 The DOC file contents should probably end up in the dump file. | 9217 incrementing a pointer to a particular character to point to the |
7472 | 9218 following character, and so on. It must not assume that one character |
7473 | 9219 is stored using one byte, or even using any particular number of bytes. |
7474 @node Events and the Event Loop, Asynchronous Events; Quit Checking, Dumping, Top | 9220 It must not assume that the number of characters in a stretch of text |
9221 bears any particular relation to a number of bytes in that stretch. It | |
9222 must not assume that the character at a particular memory location can | |
9223 be retrieved simply by dereferencing the memory location, even if a | |
9224 character is known to be ASCII or is being compared with an ASCII | |
9225 character, etc. Careful coding is required to be Mule clean. The | |
9226 biggest work of adding Mule support, in fact, is converting all of the | |
9227 existing code to be Mule clean. | |
9228 | |
9229 Lisp code is mostly unaffected by these concerns. Text in strings and | |
9230 buffers appears simply as a sequence of characters regardless of | |
9231 whether Mule support is present. The biggest difference with older | |
9232 versions of Emacs, as well as current versions of GNU Emacs, is that | |
9233 integers and characters are no longer equivalent, but are separate | |
9234 Lisp Object types. | |
9235 | |
9236 @subheading Conversion Between Internal and External Representations | |
9237 | |
9238 All text needs to be converted to an external representation before being | |
9239 sent to a function or file, and all text retrieved from a function of | |
9240 file needs to be converted to the internal representation. This | |
9241 conversion needs to happen as close to the source or destination of the | |
9242 text as possible. No operations should ever be performed on text encoded | |
9243 in an external representation other than simple copying, because no | |
9244 assumptions can reliably be made about the format of this text. You | |
9245 cannot assume, for example, that the end of text is terminated by a null | |
9246 byte. (For example, if the text is Unicode, it will have many null bytes | |
9247 in it.) You cannot find the next "slash" character by searching through | |
9248 the bytes until you find a byte that looks like a "slash" character, | |
9249 because it might actually be the second byte of a Kanji character. | |
9250 Furthermore, all text in the internal representation must be converted, | |
9251 even if it is known to be completely ASCII, because the external | |
9252 representation may not be ASCII compatible (for example, if it is | |
9253 Unicode). | |
9254 | |
9255 The place where C code needs to be the most careful is when calling | |
9256 external API functions. It is easy to forget that all text passed to or | |
9257 retrieved from these functions needs to be converted. This includes text | |
9258 in structures passed to or retrieved from these functions and all text | |
9259 that is passed to a callback function that is called by the system. | |
9260 | |
9261 Macros are provided to perform conversions to or from external text. | |
9262 These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT | |
9263 respectively. These macros accept input in various forms, for example, | |
9264 Lisp strings, buffers, lstreams, raw data, and can return data in | |
9265 multiple formats, including both @code{malloc()}ed and @code{alloca()}ed data. The use | |
9266 of @code{alloca()}ed data here is particularly important because, in general, | |
9267 the returned data will not be used after making the API call, and as a | |
9268 result, using @code{alloca()}ed data provides a very cheap and easy to use | |
9269 method of allocation. | |
9270 | |
9271 These macros take a coding system argument which indicates the nature of | |
9272 the external encoding. A coding system is an object that encapsulates | |
9273 the structures of a particular external encoding and the methods required | |
9274 to convert to and from this encoding. A facility exists to create coding | |
9275 system aliases, which in essence gives a single coding system two | |
9276 different names. It is effectively used in XEmacs to provide a layer of | |
9277 abstraction on top of the actual coding systems. For example, the coding | |
9278 system alias "file-name" points to whichever coding system is currently | |
9279 used for encoding and decoding file names as passed to or retrieved from | |
9280 system calls. In general, the actual encoding will differ from system to | |
9281 system, and also on the particular locale that the user is in. The use | |
9282 of the file-name alias effectively hides that implementation detail on | |
9283 top of that abstract interface layer which provides a unified set of | |
9284 coding systems which are consistent across all operating environments. | |
9285 | |
9286 The choice of which coding system to use in a particular conversion macro | |
9287 requires some thought. In general, you should choose a lower-level | |
9288 actual coding system when the very design of the APIs you are working | |
9289 with call for that particular coding system. In all other cases, you | |
9290 should find the least general abstract coding system (i.e. coding system | |
9291 alias) that applies to your specific situation. Only use the most | |
9292 general coding systems, such as native, when there is simply nothing else | |
9293 that is more appropriate. By doing things this way, you allow the user | |
9294 more control over how the encoding actually works, because the user is | |
9295 free to map the abstracted coding system names onto to different actual | |
9296 coding systems. | |
9297 | |
9298 Some common coding systems are: | |
9299 | |
9300 @table @code | |
9301 @item ctext | |
9302 Compound Text, which is the standard encoding under X Windows, which is | |
9303 used for clipboard data and possibly other data. (ctext is a coding | |
9304 system of type ISO2022.) | |
9305 | |
9306 @item mswindows-unicode | |
9307 this is used for representing text passed to MS Window API calls with | |
9308 arguments that need to be in Unicode format. (mswindows-unicode is a | |
9309 coding system of type UTF-16) | |
9310 | |
9311 @item ms-windows-multi-byte | |
9312 this is used for representing text passed to MS Windows API calls with | |
9313 arguments that need to be in multi-byte format. Note that there are | |
9314 very few if any examples of such calls. | |
9315 | |
9316 @item mswindows-tstr | |
9317 this is used for representing text passed to any MS Windows API calls | |
9318 that declare their argument as LPTSTR, or LPCTSTR. This is the vast | |
9319 majority of system calls and automatically translates either to | |
9320 mswindows-unicode or mswindows-multi-byte, depending on the presence or | |
9321 absence of the UNICODE preprocessor constant. (If we compile XEmacs | |
9322 with this preprocessor constant, then all API calls use Unicode for all | |
9323 text passed to or received from these API calls.) | |
9324 | |
9325 @item terminal | |
9326 used for text sent to or read from a text terminal in the absence of a | |
9327 more specific coding system (calls to window-system specific APIs should | |
9328 use the appropriate window-specific coding system if it makes sense to | |
9329 do so.) | |
9330 | |
9331 @item file-name | |
9332 used when specifying the names of files in the absence of a more | |
9333 specific encoding, such as ms-windows-tstr. | |
9334 | |
9335 @item native | |
9336 the most general coding system for specifying text passed to system | |
9337 calls. This generally translates to whatever coding system is specified | |
9338 by the current locale. This should only be used when none of the coding | |
9339 systems mentioned above are appropriate. | |
9340 @end table | |
9341 | |
9342 @subheading Proper Display of Multilingual Text | |
9343 | |
9344 There are two things required to get this working correctly. One is | |
9345 selecting the correct font, and the other is encoding the text according | |
9346 to the encoding used for that specific font, or the window-system | |
9347 specific text display API. Generally each separate character set has a | |
9348 different font associated with it, which is specified by name and each | |
9349 font has an associated encoding into which the characters must be | |
9350 translated. (this is the case on X Windows, at least; on Windows there | |
9351 is a more general mechanism). Both the specific font for a charset and | |
9352 the encoding of that font are system dependent. Currently there is a | |
9353 way of specifying these two properties under X Windows (using the | |
9354 registry and ccl properties of a character set) but not for other window | |
9355 systems. A more general system needs to be implemented to allow these | |
9356 characteristics to be specified for all Windows systems. | |
9357 | |
9358 Another issue is making sure that the necessary fonts for displaying | |
9359 various character sets are installed on the system. Currently, XEmacs | |
9360 provides, on its web site, X Windows fonts for a number of different | |
9361 character sets that can be installed by users. This isn't done yet for | |
9362 Windows, but it should be. | |
9363 | |
9364 @subheading Inputting of Multilingual Text | |
9365 | |
9366 This is a rather complicated issue because there are many paradigms | |
9367 defined for inputting multi-lingual text, some of which are specific to | |
9368 particular languages, and any particular language may have many | |
9369 different paradigms defined for inputting its text. These paradigms are | |
9370 encoded in input methods and there is a standard API for defining an | |
9371 input method in XEmacs called LEIM, or Library of Emacs Input Methods. | |
9372 Some of these input methods are written entirely in Elisp, and thus are | |
9373 system-independent, while others require the aid either of an external | |
9374 process, or of C level support that ties into a particular | |
9375 system-specific input method API, for example, XIM under X Windows, or | |
9376 the active keyboard layout and IME support under Windows. Currently, | |
9377 there is no support for any system-specific input methods under | |
9378 Microsoft Windows, although this will change. | |
9379 | |
9380 @node Introduction to Multilingual Issues #4, Character Sets, Introduction to Multilingual Issues #3, Multilingual Support | |
9381 @section Introduction to Multilingual Issues #4 | |
9382 @cindex introduction to multilingual issues #4 | |
9383 | |
9384 The rest of the sections in this chapter consist of yet another | |
9385 introduction to multilingual issues, duplicating the information in the | |
9386 previous sections. | |
9387 | |
9388 @node Character Sets, Encodings, Introduction to Multilingual Issues #4, Multilingual Support | |
9389 @section Character Sets | |
9390 @cindex character sets | |
9391 | |
9392 A @dfn{character set} (or @dfn{charset}) is an ordered set of | |
9393 characters. A particular character in a charset is indexed using one or | |
9394 more @dfn{position codes}, which are non-negative integers. The number | |
9395 of position codes needed to identify a particular character in a charset | |
9396 is called the @dfn{dimension} of the charset. In XEmacs/Mule, all | |
9397 charsets have dimension 1 or 2, and the size of all charsets (except for | |
9398 a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range | |
9399 of position codes used to index characters from any of these types of | |
9400 character sets is as follows: | |
9401 | |
9402 @example | |
9403 Charset type Position code 1 Position code 2 | |
9404 ------------------------------------------------------------ | |
9405 94 33 - 126 N/A | |
9406 96 32 - 127 N/A | |
9407 94x94 33 - 126 33 - 126 | |
9408 96x96 32 - 127 32 - 127 | |
9409 @end example | |
9410 | |
9411 Note that in the above cases position codes do not start at an | |
9412 expected value such as 0 or 1. The reason for this will become clear | |
9413 later. | |
9414 | |
9415 For example, Latin-1 is a 96-character charset, and JISX0208 (the | |
9416 Japanese national character set) is a 94x94-character charset. | |
9417 | |
9418 [Note that, although the ranges above define the @emph{valid} position | |
9419 codes for a charset, some of the slots in a particular charset may in | |
9420 fact be empty. This is the case for JISX0208, for example, where (e.g.) | |
9421 all the slots whose first position code is in the range 118 - 127 are | |
9422 empty.] | |
9423 | |
9424 There are three charsets that do not follow the above rules. All of | |
9425 them have one dimension, and have ranges of position codes as follows: | |
9426 | |
9427 @example | |
9428 Charset name Position code 1 | |
9429 ------------------------------------ | |
9430 ASCII 0 - 127 | |
9431 Control-1 0 - 31 | |
9432 Composite 0 - some large number | |
9433 @end example | |
9434 | |
9435 (The upper bound of the position code for composite characters has not | |
9436 yet been determined, but it will probably be at least 16,383). | |
9437 | |
9438 ASCII is the union of two subsidiary character sets: Printing-ASCII | |
9439 (the printing ASCII character set, consisting of position codes 33 - | |
9440 126, like for a standard 94-character charset) and Control-ASCII (the | |
9441 non-printing characters that would appear in a binary file with codes 0 | |
9442 - 32 and 127). | |
9443 | |
9444 Control-1 contains the non-printing characters that would appear in a | |
9445 binary file with codes 128 - 159. | |
9446 | |
9447 Composite contains characters that are generated by overstriking one | |
9448 or more characters from other charsets. | |
9449 | |
9450 Note that some characters in ASCII, and all characters in Control-1, | |
9451 are @dfn{control} (non-printing) characters. These have no printed | |
9452 representation but instead control some other function of the printing | |
9453 (e.g. TAB or 8 moves the current character position to the next tab | |
9454 stop). All other characters in all charsets are @dfn{graphic} | |
9455 (printing) characters. | |
9456 | |
9457 When a binary file is read in, the bytes in the file are assigned to | |
9458 character sets as follows: | |
9459 | |
9460 @example | |
9461 Bytes Character set Range | |
9462 -------------------------------------------------- | |
9463 0 - 127 ASCII 0 - 127 | |
9464 128 - 159 Control-1 0 - 31 | |
9465 160 - 255 Latin-1 32 - 127 | |
9466 @end example | |
9467 | |
9468 This is a bit ad-hoc but gets the job done. | |
9469 | |
9470 @node Encodings, Internal Mule Encodings, Character Sets, Multilingual Support | |
9471 @section Encodings | |
9472 @cindex encodings, Mule | |
9473 @cindex Mule encodings | |
9474 | |
9475 An @dfn{encoding} is a way of numerically representing characters from | |
9476 one or more character sets. If an encoding only encompasses one | |
9477 character set, then the position codes for the characters in that | |
9478 character set could be used directly. This is not possible, however, if | |
9479 more than one character set is to be used in the encoding. | |
9480 | |
9481 For example, the conversion detailed above between bytes in a binary | |
9482 file and characters is effectively an encoding that encompasses the | |
9483 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit | |
9484 bytes. | |
9485 | |
9486 Thus, an encoding can be viewed as a way of encoding characters from a | |
9487 specified group of character sets using a stream of bytes, each of which | |
9488 contains a fixed number of bits (but not necessarily 8, as in the common | |
9489 usage of ``byte''). | |
9490 | |
9491 Here are descriptions of a couple of common | |
9492 encodings: | |
9493 | |
9494 @menu | |
9495 * Japanese EUC (Extended Unix Code):: | |
9496 * JIS7:: | |
9497 @end menu | |
9498 | |
9499 @node Japanese EUC (Extended Unix Code), JIS7, Encodings, Encodings | |
9500 @subsection Japanese EUC (Extended Unix Code) | |
9501 @cindex Japanese EUC (Extended Unix Code) | |
9502 @cindex EUC (Extended Unix Code), Japanese | |
9503 @cindex Extended Unix Code, Japanese EUC | |
9504 | |
9505 This encompasses the character sets Printing-ASCII, Katakana-JISX0201 | |
9506 (half-width katakana, the right half of JISX0201), Japanese-JISX0208, | |
9507 and Japanese-JISX0212. | |
9508 | |
9509 Note that Printing-ASCII and Katakana-JISX0201 are 94-character | |
9510 charsets, while Japanese-JISX0208 and Japanese-JISX0212 are | |
9511 94x94-character charsets. | |
9512 | |
9513 The encoding is as follows: | |
9514 | |
9515 @example | |
9516 Character set Representation (PC=position-code) | |
9517 ------------- -------------- | |
9518 Printing-ASCII PC1 | |
9519 Katakana-JISX0201 0x8E | PC1 + 0x80 | |
9520 Japanese-JISX0208 PC1 + 0x80 | PC2 + 0x80 | |
9521 Japanese-JISX0212 PC1 + 0x80 | PC2 + 0x80 | |
9522 @end example | |
9523 | |
9524 Note that there are other versions of EUC for other Asian languages. | |
9525 EUC in general is characterized by | |
9526 | |
9527 @enumerate | |
9528 @item | |
9529 row-column encoding, | |
9530 @item | |
9531 big-endian (row-first) ordering, and | |
9532 @item | |
9533 ASCII compatibility in variable width forms. | |
9534 @end enumerate | |
9535 | |
9536 @node JIS7, , Japanese EUC (Extended Unix Code), Encodings | |
9537 @subsection JIS7 | |
9538 @cindex JIS7 | |
9539 | |
9540 This encompasses the character sets Printing-ASCII, | |
9541 Latin-JISX0201 (the left half of JISX0201; this character set | |
9542 is very similar to Printing-ASCII and is a 94-character charset), | |
9543 Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes. | |
9544 | |
9545 Unlike EUC, this is a @dfn{modal} encoding, which means that there are | |
9546 multiple states that the encoding can be in, which affect how the bytes | |
9547 are to be interpreted. Special sequences of bytes (called @dfn{escape | |
9548 sequences}) are used to change states. | |
9549 | |
9550 The encoding is as follows: | |
9551 | |
9552 @example | |
9553 Character set Representation (PC=position-code) | |
9554 ------------- -------------- | |
9555 Printing-ASCII PC1 | |
9556 Latin-JISX0201 PC1 | |
9557 Katakana-JISX0201 PC1 | |
9558 Japanese-JISX0208 PC1 | PC2 | |
9559 | |
9560 | |
9561 Escape sequence ASCII equivalent Meaning | |
9562 --------------- ---------------- ------- | |
9563 0x1B 0x28 0x4A ESC ( J invoke Latin-JISX0201 | |
9564 0x1B 0x28 0x49 ESC ( I invoke Katakana-JISX0201 | |
9565 0x1B 0x24 0x42 ESC $ B invoke Japanese-JISX0208 | |
9566 0x1B 0x28 0x42 ESC ( B invoke Printing-ASCII | |
9567 @end example | |
9568 | |
9569 Initially, Printing-ASCII is invoked. | |
9570 | |
9571 @node Internal Mule Encodings, Byte/Character Types; Buffer Positions; Other Typedefs, Encodings, Multilingual Support | |
9572 @section Internal Mule Encodings | |
9573 @cindex internal Mule encodings | |
9574 @cindex Mule encodings, internal | |
9575 @cindex encodings, internal Mule | |
9576 | |
9577 In XEmacs/Mule, each character set is assigned a unique number, called a | |
9578 @dfn{leading byte}. This is used in the encodings of a character. | |
9579 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has | |
9580 a leading byte of 0), although some leading bytes are reserved. | |
9581 | |
9582 Charsets whose leading byte is in the range 0x80 - 0x9F are called | |
9583 @dfn{official} and are used for built-in charsets. Other charsets are | |
9584 called @dfn{private} and have leading bytes in the range 0xA0 - 0xFF; | |
9585 these are user-defined charsets. | |
9586 | |
9587 More specifically: | |
9588 | |
9589 @example | |
9590 Character set Leading byte | |
9591 ------------- ------------ | |
9592 ASCII 0 (0x7F in arrays indexed by leading byte) | |
9593 Composite 0x8D | |
9594 Dimension-1 Official 0x80 - 0x8C/0x8D | |
9595 (0x8E is free) | |
9596 Control 0x8F | |
9597 Dimension-2 Official 0x90 - 0x99 | |
9598 (0x9A - 0x9D are free) | |
9599 Dimension-1 Private Marker 0x9E | |
9600 Dimension-2 Private Marker 0x9F | |
9601 Dimension-1 Private 0xA0 - 0xEF | |
9602 Dimension-2 Private 0xF0 - 0xFF | |
9603 @end example | |
9604 | |
9605 There are two internal encodings for characters in XEmacs/Mule. One is | |
9606 called @dfn{string encoding} and is an 8-bit encoding that is used for | |
9607 representing characters in a buffer or string. It uses 1 to 4 bytes per | |
9608 character. The other is called @dfn{character encoding} and is a 19-bit | |
9609 encoding that is used for representing characters individually in a | |
9610 variable. | |
9611 | |
9612 (In the following descriptions, we'll ignore composite characters for | |
9613 the moment. We also give a general (structural) overview first, | |
9614 followed later by the exact details.) | |
9615 | |
9616 @menu | |
9617 * Internal String Encoding:: | |
9618 * Internal Character Encoding:: | |
9619 @end menu | |
9620 | |
9621 @node Internal String Encoding, Internal Character Encoding, Internal Mule Encodings, Internal Mule Encodings | |
9622 @subsection Internal String Encoding | |
9623 @cindex internal string encoding | |
9624 @cindex string encoding, internal | |
9625 @cindex encoding, internal string | |
9626 | |
9627 ASCII characters are encoded using their position code directly. Other | |
9628 characters are encoded using their leading byte followed by their | |
9629 position code(s) with the high bit set. Characters in private character | |
9630 sets have their leading byte prefixed with a @dfn{leading byte prefix}, | |
9631 which is either 0x9E or 0x9F. (No character sets are ever assigned these | |
9632 leading bytes.) Specifically: | |
9633 | |
9634 @example | |
9635 Character set Encoding (PC=position-code, LB=leading-byte) | |
9636 ------------- -------- | |
9637 ASCII PC-1 | | |
9638 Control-1 LB | PC1 + 0xA0 | | |
9639 Dimension-1 official LB | PC1 + 0x80 | | |
9640 Dimension-1 private 0x9E | LB | PC1 + 0x80 | | |
9641 Dimension-2 official LB | PC1 + 0x80 | PC2 + 0x80 | | |
9642 Dimension-2 private 0x9F | LB | PC1 + 0x80 | PC2 + 0x80 | |
9643 @end example | |
9644 | |
9645 The basic characteristic of this encoding is that the first byte | |
9646 of all characters is in the range 0x00 - 0x9F, and the second and | |
9647 following bytes of all characters is in the range 0xA0 - 0xFF. | |
9648 This means that it is impossible to get out of sync, or more | |
9649 specifically: | |
9650 | |
9651 @enumerate | |
9652 @item | |
9653 Given any byte position, the beginning of the character it is | |
9654 within can be determined in constant time. | |
9655 @item | |
9656 Given any byte position at the beginning of a character, the | |
9657 beginning of the next character can be determined in constant | |
9658 time. | |
9659 @item | |
9660 Given any byte position at the beginning of a character, the | |
9661 beginning of the previous character can be determined in constant | |
9662 time. | |
9663 @item | |
9664 Textual searches can simply treat encoded strings as if they | |
9665 were encoded in a one-byte-per-character fashion rather than | |
9666 the actual multi-byte encoding. | |
9667 @end enumerate | |
9668 | |
9669 None of the standard non-modal encodings meet all of these | |
9670 conditions. For example, EUC satisfies only (2) and (3), while | |
9671 Shift-JIS and Big5 (not yet described) satisfy only (2). (All | |
9672 non-modal encodings must satisfy (2), in order to be unambiguous.) | |
9673 | |
9674 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings | |
9675 @subsection Internal Character Encoding | |
9676 @cindex internal character encoding | |
9677 @cindex character encoding, internal | |
9678 @cindex encoding, internal character | |
9679 | |
9680 One 19-bit word represents a single character. The word is | |
9681 separated into three fields: | |
9682 | |
9683 @example | |
9684 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | |
9685 <------------> <------------------> <------------------> | |
9686 Field: 1 2 3 | |
9687 @end example | |
9688 | |
9689 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits. | |
9690 | |
9691 @example | |
9692 Character set Field 1 Field 2 Field 3 | |
9693 ------------- ------- ------- ------- | |
9694 ASCII 0 0 PC1 | |
9695 range: (00 - 7F) | |
9696 Control-1 0 1 PC1 | |
9697 range: (00 - 1F) | |
9698 Dimension-1 official 0 LB - 0x7F PC1 | |
9699 range: (01 - 0D) (20 - 7F) | |
9700 Dimension-1 private 0 LB - 0x80 PC1 | |
9701 range: (20 - 6F) (20 - 7F) | |
9702 Dimension-2 official LB - 0x8F PC1 PC2 | |
9703 range: (01 - 0A) (20 - 7F) (20 - 7F) | |
9704 Dimension-2 private LB - 0xE1 PC1 PC2 | |
9705 range: (0F - 1E) (20 - 7F) (20 - 7F) | |
9706 Composite 0x1F ? ? | |
9707 @end example | |
9708 | |
9709 Note that character codes 0 - 255 are the same as the ``binary | |
9710 encoding'' described above. | |
9711 | |
9712 Most of the code in XEmacs knows nothing of the representation of a | |
9713 character other than that values 0 - 255 represent ASCII, Control 1, | |
9714 and Latin 1. | |
9715 | |
9716 @strong{WARNING WARNING WARNING}: The Boyer-Moore code in | |
9717 @file{search.c}, and the code in @code{search_buffer()} that determines | |
9718 whether that code can be used, knows that ``field 3'' in a character | |
9719 always corresponds to the last byte in the textual representation of the | |
9720 character. (This is important because the Boyer-Moore algorithm works by | |
9721 looking at the last byte of the search string and &&#### finish this. | |
9722 | |
9723 @node Byte/Character Types; Buffer Positions; Other Typedefs, Internal Text API's, Internal Mule Encodings, Multilingual Support | |
9724 @section Byte/Character Types; Buffer Positions; Other Typedefs | |
9725 @cindex byte/character types; buffer positions; other typedefs | |
9726 @cindex byte/character types | |
9727 @cindex character types | |
9728 @cindex buffer positions | |
9729 @cindex typedefs, other | |
9730 | |
9731 @menu | |
9732 * Byte Types:: | |
9733 * Different Ways of Seeing Internal Text:: | |
9734 * Buffer Positions:: | |
9735 * Other Typedefs:: | |
9736 * Usage of the Various Representations:: | |
9737 * Working With the Various Representations:: | |
9738 @end menu | |
9739 | |
9740 @node Byte Types, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs | |
9741 @subsection Byte Types | |
9742 @cindex byte types | |
9743 | |
9744 Stuff pointed to by a char * or unsigned char * will nearly always be | |
9745 one of the following types: | |
9746 | |
9747 @itemize @minus | |
9748 @item | |
9749 a) [Ibyte] pointer to internally-formatted text | |
9750 @item | |
9751 b) [Extbyte] pointer to text in some external format, which can be | |
9752 defined as all formats other than the internal one | |
9753 @item | |
9754 c) [Ascbyte] pure ASCII text | |
9755 @item | |
9756 d) [Binbyte] binary data that is not meant to be interpreted as text | |
9757 @item | |
9758 e) [Rawbyte] general data in memory, where we don't care about whether | |
9759 it's text or binary | |
9760 @item | |
9761 f) [Boolbyte] a zero or a one | |
9762 @item | |
9763 g) [Bitbyte] a byte used for bit fields | |
9764 @item | |
9765 h) [Chbyte] null-semantics @code{char *}; used when casting an argument to | |
9766 an external API where the the other types may not be | |
9767 appropriate | |
9768 @end itemize | |
9769 | |
9770 Types (b), (c), (f) and (h) are defined as @code{char}, while the others are | |
9771 @code{unsigned char}. This is for maximum safety (signed characters are | |
9772 dangerous to work with) while maintaining as much compatibility with | |
9773 external API's and string constants as possible. | |
9774 | |
9775 We also provide versions of the above types defined with different | |
9776 underlying C types, for API compatibility. These use the following | |
9777 prefixes: | |
9778 | |
9779 @example | |
9780 C = plain char, when the base type is unsigned | |
9781 U = unsigned | |
9782 S = signed | |
9783 @end example | |
9784 | |
9785 (Formerly I had a comment saying that type (e) "should be replaced with | |
9786 void *". However, there are in fact many places where an unsigned char | |
9787 * might be used -- e.g. for ease in pointer computation, since void * | |
9788 doesn't allow this, and for compatibility with external API's.) | |
9789 | |
9790 Note that these typedefs are purely for documentation purposes; from | |
9791 the C code's perspective, they are exactly equivalent to @code{char *}, | |
9792 @code{unsigned char *}, etc., so you can freely use them with library | |
9793 functions declared as such. | |
9794 | |
9795 Using these more specific types rather than the general ones helps avoid | |
9796 the confusions that occur when the semantics of a char * or unsigned | |
9797 char * argument being studied are unclear. Furthermore, by requiring | |
9798 that ALL uses of @code{char} be replaced with some other type as part of the | |
9799 Mule-ization process, we can use a search for @code{char} as a way of finding | |
9800 code that has not been properly Mule-ized yet. | |
9801 | |
9802 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs | |
9803 @subsection Different Ways of Seeing Internal Text | |
9804 @cindex different ways of seeing internal text | |
9805 | |
9806 There are various ways of representing internal text. The two primary | |
9807 ways are as an "array" of individual characters; the other is as a | |
9808 "stream" of bytes. In the ASCII world, where there are only 255 | |
9809 characters at most, things are easy because each character fits into a | |
9810 byte. In general, however, this is not true -- see the above discussion | |
9811 of characters vs. encodings. | |
9812 | |
9813 In some cases, it's also important to distinguish between a stream | |
9814 representation as a series of bytes and as a series of textual units. | |
9815 This is particularly important wrt Unicode. The UTF-16 representation | |
9816 (sometimes referred to, rather sloppily, as simply the "Unicode" format) | |
9817 represents text as a series of 16-bit units. Mostly, each unit | |
9818 corresponds to a single character, but not necessarily, as characters | |
9819 outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of | |
9820 Unicode) require two 16-bit units, through the mechanism of | |
9821 "surrogates". When a series of 16-bit units is serialized into a byte | |
9822 stream, there are at least two possible representations, little-endian | |
9823 and big-endian, and which one is used may depend on the native format of | |
9824 16-bit integers in the CPU of the machine that XEmacs is running | |
9825 on. (Similarly, UTF-32 is logically a representation with 32-bit textual | |
9826 units.) | |
9827 | |
9828 Specifically: | |
9829 | |
9830 @itemize @minus | |
9831 @item | |
9832 UTF-8 has 1-byte (8-bit) units. | |
9833 @item | |
9834 UTF-16 has 2-byte (16-bit) units. | |
9835 @item | |
9836 UTF-32 has 4-byte (32-bit) units. | |
9837 @item | |
9838 XEmacs-internal encoding (the old "Mule" encoding) has 1-byte (8-bit) | |
9839 units. | |
9840 @item | |
9841 UTF-7 technically has 7-bit units that are within the "mail-safe" range | |
9842 (ASCII 32 - 126 plus a few control characters), but normally is encoded | |
9843 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a | |
9844 normal mode where printable ASCII characters represent themselves and a | |
9845 shifted mode, introduced with a plus sign, where a base-64 encoding is | |
9846 used.) | |
9847 @item | |
9848 UTF-5 technically has 7-bit units (normally encoded in an 8-bit stream, | |
9849 like UTF-7), but only uses uppercase A-V and 0-9, and only encodes 4 | |
9850 bits worth of data per character. UTF-5 is meant for encoding Unicode | |
9851 inside of DNS names. | |
9852 @end itemize | |
9853 | |
9854 Thus, we can imagine three levels in the representation of texual data: | |
9855 | |
9856 @example | |
9857 series of characters -> series of textual units -> series of bytes | |
9858 [Ichar] [Itext] [Ibyte] | |
9859 @end example | |
9860 | |
9861 XEmacs has three corresponding typedefs: | |
9862 | |
9863 @itemize @minus | |
9864 @item | |
9865 An Ichar is an integer (at least 32-bit), representing a 31-bit | |
9866 character. | |
9867 @item | |
9868 An Itext is an unsigned value, either 8, 16 or 32 bits, depending | |
9869 on the nature of the internal representation, and corresponding to | |
9870 a single textual unit. | |
9871 @item | |
9872 An Ibyte is an @code{unsigned char}, representing a single byte in a | |
9873 textual byte stream. | |
9874 @end itemize | |
9875 | |
9876 Internal text in stream format can be simultaneously viewed as either | |
9877 @code{Itext *} or @code{Ibyte *}. The @code{Ibyte *} representation is convenient for | |
9878 copying data from one place to another, because such routines usually | |
9879 expect byte counts. However, @code{Itext *} is much better for actually | |
9880 working with the data. | |
9881 | |
9882 From a text-unit perspective, units 0 through 127 will always be ASCII | |
9883 compatible, and data in Lisp strings (and other textual data generated | |
9884 as a whole, e.g. from external conversion) will be followed by a | |
9885 null-unit terminator. From an @code{Ibyte *} perspective, however, the | |
9886 encoding is only ASCII-compatible if it uses 1-byte units. | |
9887 | |
9888 Similarly to the different text representations, three integral count | |
9889 types exist -- Charcount, Textcount and Bytecount. | |
9890 | |
9891 NOTE: Despite the presence of the terminator, internal text itself can | |
9892 have nulls in it! (Null text units, not just the null bytes present in | |
9893 any UTF-16 encoding.) The terminator is present because in many cases | |
9894 internal text is passed to routines that will ultimately pass the text | |
9895 to library functions that cannot handle embedded nulls, e.g. functions | |
9896 manipulating filenames, and it is a real hassle to have to pass the | |
9897 length around constantly. But this can lead to sloppy coding! We need | |
9898 to be careful about watching for nulls in places that are important, | |
9899 e.g. manipulating string objects or passing data to/from the clipboard. | |
9900 | |
9901 @table @code | |
9902 @item Ibyte | |
9903 The data in a buffer or string is logically made up of Ibyte objects, | |
9904 where a Ibyte takes up the same amount of space as a char. (It is | |
9905 declared differently, though, to catch invalid usages.) Strings stored | |
9906 using Ibytes are said to be in "internal format". The important | |
9907 characteristics of internal format are | |
9908 | |
9909 @itemize @minus | |
9910 @item | |
9911 ASCII characters are represented as a single Ibyte, in the range 0 - | |
9912 0x7f. | |
9913 @item | |
9914 All other characters are represented as a Ibyte in the range 0x80 - 0x9f | |
9915 followed by one or more Ibytes in the range 0xa0 to 0xff. | |
9916 @end itemize | |
9917 | |
9918 This leads to a number of desirable properties: | |
9919 | |
9920 @itemize @minus | |
9921 @item | |
9922 Given the position of the beginning of a character, you can find the | |
9923 beginning of the next or previous character in constant time. | |
9924 @item | |
9925 When searching for a substring or an ASCII character within the string, | |
9926 you need merely use standard searching routines. | |
9927 @end itemize | |
9928 | |
9929 @item Itext | |
9930 | |
9931 #### Document me. | |
9932 | |
9933 @item Ichar | |
9934 This typedef represents a single Emacs character, which can be ASCII, | |
9935 ISO-8859, or some extended character, as would typically be used for | |
9936 Kanji. Note that the representation of a character as an Ichar is @strong{not} | |
9937 the same as the representation of that same character in a string; thus, | |
9938 you cannot do the standard C trick of passing a pointer to a character | |
9939 to a function that expects a string. | |
9940 | |
9941 An Ichar takes up 19 bits of representation and (for code compatibility | |
9942 and such) is compatible with an int. This representation is visible on | |
9943 the Lisp level. The important characteristics of the Ichar | |
9944 representation are | |
9945 | |
9946 @itemize @minus | |
9947 @item | |
9948 values 0x00 - 0x7f represent ASCII. | |
9949 @item | |
9950 values 0x80 - 0xff represent the right half of ISO-8859-1. | |
9951 @item | |
9952 values 0x100 and up represent all other characters. | |
9953 @end itemize | |
9954 | |
9955 This means that Ichar values are upwardly compatible with the standard | |
9956 8-bit representation of ASCII/ISO-8859-1. | |
9957 | |
9958 @item Extbyte | |
9959 Strings that go in or out of Emacs are in "external format", typedef'ed | |
9960 as an array of char or a char *. There is more than one external format | |
9961 (JIS, EUC, etc.) but they all have similar properties. They are modal | |
9962 encodings, which is to say that the meaning of particular bytes is not | |
9963 fixed but depends on what "mode" the string is currently in (e.g. bytes | |
9964 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or | |
9965 as 2-byte Kanji, depending on the current mode). The mode starts out in | |
9966 ASCII/ISO-8859-1 and is switched using escape sequences -- for example, | |
9967 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes | |
9968 in the range 0 - 0x7f are interpreted as Kanji characters. | |
9969 | |
9970 External-formatted data is generally desirable for passing data between | |
9971 programs because it is upwardly compatible with standard | |
9972 ASCII/ISO-8859-1 strings and may require less space than internal | |
9973 encodings such as the one described above. In addition, some encodings | |
9974 (e.g. JIS) keep all characters (except the ESC used to switch modes) in | |
9975 the printing ASCII range 0x20 - 0x7e, which results in a much higher | |
9976 probability that the data will avoid being garbled in transmission. | |
9977 Externally-formatted data is generally not very convenient to work with, | |
9978 however, and for this reason is usually converted to internal format | |
9979 before any work is done on the string. | |
9980 | |
9981 NOTE: filenames need to be in external format so that ISO-8859-1 | |
9982 characters come out correctly. | |
9983 @end table | |
9984 | |
9985 @node Buffer Positions, Other Typedefs, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs | |
9986 @subsection Buffer Positions | |
9987 @cindex buffer positions | |
9988 | |
9989 There are three possible ways to specify positions in a buffer. All | |
9990 of these are one-based: the beginning of the buffer is position or | |
9991 index 1, and 0 is not a valid position. | |
9992 | |
9993 As a "buffer position" (typedef Charbpos): | |
9994 | |
9995 This is an index specifying an offset in characters from the | |
9996 beginning of the buffer. Note that buffer positions are | |
9997 logically @strong{between} characters, not on a character. The | |
9998 difference between two buffer positions specifies the number of | |
9999 characters between those positions. Buffer positions are the | |
10000 only kind of position externally visible to the user. | |
10001 | |
10002 As a "byte index" (typedef Bytebpos): | |
10003 | |
10004 This is an index over the bytes used to represent the characters | |
10005 in the buffer. If there is no Mule support, this is identical | |
10006 to a buffer position, because each character is represented | |
10007 using one byte. However, with Mule support, many characters | |
10008 require two or more bytes for their representation, and so a | |
10009 byte index may be greater than the corresponding buffer | |
10010 position. | |
10011 | |
10012 As a "memory index" (typedef Membpos): | |
10013 | |
10014 This is the byte index adjusted for the gap. For positions | |
10015 before the gap, this is identical to the byte index. For | |
10016 positions after the gap, this is the byte index plus the gap | |
10017 size. There are two possible memory indices for the gap | |
10018 position; the memory index at the beginning of the gap should | |
10019 always be used, except in code that deals with manipulating the | |
10020 gap, where both indices may be seen. The address of the | |
10021 character "at" (i.e. following) a particular position can be | |
10022 obtained from the formula | |
10023 | |
10024 buffer_start_address + memory_index(position) - 1 | |
10025 | |
10026 except in the case of characters at the gap position. | |
10027 | |
10028 @node Other Typedefs, Usage of the Various Representations, Buffer Positions, Byte/Character Types; Buffer Positions; Other Typedefs | |
10029 @subsection Other Typedefs | |
10030 @cindex other typedefs | |
10031 | |
10032 Charcount: | |
10033 ---------- | |
10034 This typedef represents a count of characters, such as | |
10035 a character offset into a string or the number of | |
10036 characters between two positions in a buffer. The | |
10037 difference between two Charbpos's is a Charcount, and | |
10038 character positions in a string are represented using | |
10039 a Charcount. | |
10040 | |
10041 Textcount: | |
10042 ---------- | |
10043 #### Document me. | |
10044 | |
10045 Bytecount: | |
10046 ---------- | |
10047 Similar to a Charcount but represents a count of bytes. | |
10048 The difference between two Bytebpos's is a Bytecount. | |
10049 | |
10050 | |
10051 @node Usage of the Various Representations, Working With the Various Representations, Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs | |
10052 @subsection Usage of the Various Representations | |
10053 @cindex usage of the various representations | |
10054 | |
10055 Memory indices are used in low-level functions in insdel.c and for | |
10056 extent endpoints and marker positions. The reason for this is that | |
10057 this way, the extents and markers don't need to be updated for most | |
10058 insertions, which merely shrink the gap and don't move any | |
10059 characters around in memory. | |
10060 | |
10061 (The beginning-of-gap memory index simplifies insertions w.r.t. | |
10062 markers, because text usually gets inserted after markers. For | |
10063 extents, it is merely for consistency, because text can get | |
10064 inserted either before or after an extent's endpoint depending on | |
10065 the open/closedness of the endpoint.) | |
10066 | |
10067 Byte indices are used in other code that needs to be fast, | |
10068 such as the searching, redisplay, and extent-manipulation code. | |
10069 | |
10070 Buffer positions are used in all other code. This is because this | |
10071 representation is easiest to work with (especially since Lisp | |
10072 code always uses buffer positions), necessitates the fewest | |
10073 changes to existing code, and is the safest (e.g. if the text gets | |
10074 shifted underneath a buffer position, it will still point to a | |
10075 character; if text is shifted under a byte index, it might point | |
10076 to the middle of a character, which would be bad). | |
10077 | |
10078 Similarly, Charcounts are used in all code that deals with strings | |
10079 except for code that needs to be fast, which used Bytecounts. | |
10080 | |
10081 Strings are always passed around internally using internal format. | |
10082 Conversions between external format are performed at the time | |
10083 that the data goes in or out of Emacs. | |
10084 | |
10085 @node Working With the Various Representations, , Usage of the Various Representations, Byte/Character Types; Buffer Positions; Other Typedefs | |
10086 @subsection Working With the Various Representations | |
10087 @cindex working with the various representations | |
10088 | |
10089 We write things this way because it's very important the | |
10090 MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens, | |
10091 65535 is a multiple of 3, but this may not always be the | |
10092 case. #### unfinished | |
10093 | |
10094 @node Internal Text API's, Coding for Mule, Byte/Character Types; Buffer Positions; Other Typedefs, Multilingual Support | |
10095 @section Internal Text API's | |
10096 @cindex internal text API's | |
10097 @cindex text API's, internal | |
10098 @cindex API's, text, internal | |
10099 | |
10100 @strong{NOTE}: The most current documentation for these API's is in | |
10101 @file{text.h}. In case of error, assume that file is correct and this | |
10102 one wrong. | |
10103 | |
10104 @menu | |
10105 * Basic internal-format API's:: | |
10106 * The DFC API:: | |
10107 * The Eistring API:: | |
10108 @end menu | |
10109 | |
10110 @node Basic internal-format API's, The DFC API, Internal Text API's, Internal Text API's | |
10111 @subsection Basic internal-format API's | |
10112 @cindex basic internal-format API's | |
10113 @cindex internal-format API's, basic | |
10114 @cindex API's, basic internal-format | |
10115 | |
10116 These are simple functions and macros to convert between text | |
10117 representation and characters, move forward and back in text, etc. | |
10118 | |
10119 #### Finish the rest of this. | |
10120 | |
10121 Use the following functions/macros on contiguous text in any of the | |
10122 internal formats. Those that take a format arg work on all internal | |
10123 formats; the others work only on the default (variable-width under Mule) | |
10124 format. If the text you're operating on is known to come from a buffer, | |
10125 use the buffer-level functions in buffer.h, which automatically know the | |
10126 correct format and handle the gap. | |
10127 | |
10128 Some terminology: | |
10129 | |
10130 "itext" appearing in the macros means "internal-format text" -- type | |
10131 @code{Ibyte *}. Operations on such pointers themselves, rather than on the | |
10132 text being pointed to, have "itext" instead of "itext" in the macro | |
10133 name. "ichar" in the macro names means an Ichar -- the representation | |
10134 of a character as a single integer rather than a series of bytes, as part | |
10135 of "itext". Many of the macros below are for converting between the | |
10136 two representations of characters. | |
10137 | |
10138 Note also that we try to consistently distinguish between an "Ichar" and | |
10139 a Lisp character. Stuff working with Lisp characters often just says | |
10140 "char", so we consistently use "Ichar" when that's what we're working | |
10141 with. | |
10142 | |
10143 @node The DFC API, The Eistring API, Basic internal-format API's, Internal Text API's | |
10144 @subsection The DFC API | |
10145 @cindex DFC API | |
10146 @cindex API, DFC | |
10147 | |
10148 This is for conversion between internal and external text. Note that | |
10149 there is also the "new DFC" API, which @strong{returns} a pointer to the | |
10150 converted text (in alloca space), rather than storing it into a | |
10151 variable. | |
10152 | |
10153 The macros below are used for converting data between different formats. | |
10154 Generally, the data is textual, and the formats are related to | |
10155 internationalization (e.g. converting between internal-format text and | |
10156 UTF-8) -- but the mechanism is general, and could be used for anything, | |
10157 e.g. decoding gzipped data. | |
10158 | |
10159 In general, conversion involves a source of data, a sink, the existing | |
10160 format of the source data, and the desired format of the sink. The | |
10161 macros below, however, always require that either the source or sink is | |
10162 internal-format text. Therefore, in practice the conversions below | |
10163 involve source, sink, an external format (specified by a coding system), | |
10164 and the direction of conversion (internal->external or vice-versa). | |
10165 | |
10166 Sources and sinks can be raw data (sized or unsized -- when unsized, | |
10167 input data is assumed to be null-terminated [double null-terminated for | |
10168 Unicode-format data], and on output the length is not stored anywhere), | |
10169 Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the | |
10170 output is raw data, the result can be allocated either with @code{alloca()} or | |
10171 @code{malloc()}. (There is currently no provision for writing into a fixed | |
10172 buffer. If you want this, use @code{alloca()} output and then copy the data -- | |
10173 but be careful with the size! Unless you are very sure of the encoding | |
10174 being used, upper bounds for the size are not in general computable.) | |
10175 The obvious restrictions on source and sink types apply (e.g. Lisp | |
10176 strings are a source and sink only for internal data). | |
10177 | |
10178 All raw data outputted will contain an extra null byte (two bytes for | |
10179 Unicode -- currently, in fact, all output data, whether internal or | |
10180 external, is double-null-terminated, but you can't count on this; see | |
10181 below). This means that enough space is allocated to contain the extra | |
10182 nulls; however, these nulls are not reflected in the returned output | |
10183 size. | |
10184 | |
10185 The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. | |
10186 These can be used to convert between any kinds of sources or sinks. | |
10187 However, 99% of conversions involve raw data or Lisp strings as both | |
10188 source and sink, and usually data is output as @code{alloca()} rather than | |
10189 @code{malloc()}. For this reason, convenience macros are defined for many types | |
10190 of conversions involving raw data and/or Lisp strings, especially when | |
10191 the output is an @code{alloca()}ed string. (When the destination is a | |
10192 Lisp_String, there are other functions that should be used instead -- | |
10193 @code{build_ext_string()} and @code{make_ext_string()}, for example.) The convenience | |
10194 macros are of two types -- the older kind that store the result into a | |
10195 specified variable, and the newer kind that return the result. The newer | |
10196 kind of macros don't exist when the output is sized data, because that | |
10197 would have two return values. NOTE: All convenience macros are | |
10198 ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. | |
10199 Thus, any comments below about the workings of these macros also apply to | |
10200 all convenience macros. | |
10201 | |
10202 @example | |
10203 TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys) | |
10204 TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys) | |
10205 @end example | |
10206 | |
10207 Typical use is | |
10208 | |
10209 @example | |
10210 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name); | |
10211 @end example | |
10212 | |
10213 which means that the contents of the lisp string @var{str} are written | |
10214 to a malloc'ed memory area which will be pointed to by @var{ptr}, after the | |
10215 function returns. The conversion will be done using the @code{file-name} | |
10216 coding system (which will be controlled by the user indirectly by | |
10217 setting or binding the variable @code{file-name-coding-system}). | |
10218 | |
10219 Some sources and sinks require two C variables to specify. We use | |
10220 some preprocessor magic to allow different source and sink types, and | |
10221 even different numbers of arguments to specify different types of | |
10222 sources and sinks. | |
10223 | |
10224 So we can have a call that looks like | |
10225 | |
10226 @example | |
10227 TO_INTERNAL_FORMAT (DATA, (ptr, len), | |
10228 MALLOC, (ptr, len), | |
10229 coding_system); | |
10230 @end example | |
10231 | |
10232 The parenthesized argument pairs are required to make the | |
10233 preprocessor magic work. | |
10234 | |
10235 NOTE: GC is inhibited during the entire operation of these macros. This | |
10236 is because frequently the data to be converted comes from strings but | |
10237 gets passed in as just DATA, and GC may move around the string data. If | |
10238 we didn't inhibit GC, there'd have to be a lot of messy recoding, | |
10239 alloca-copying of strings and other annoying stuff. | |
10240 | |
10241 The source or sink can be specified in one of these ways: | |
10242 | |
10243 @example | |
10244 DATA, (ptr, len), // input data is a fixed buffer of size len | |
10245 ALLOCA, (ptr, len), // output data is in a @code{ALLOCA()}ed buffer of size len | |
10246 MALLOC, (ptr, len), // output data is in a @code{malloc()}ed buffer of size len | |
10247 C_STRING_ALLOCA, ptr, // equivalent to ALLOCA (ptr, len_ignored) on output | |
10248 C_STRING_MALLOC, ptr, // equivalent to MALLOC (ptr, len_ignored) on output | |
10249 C_STRING, ptr, // equivalent to DATA, (ptr, strlen/wcslen (ptr)) | |
10250 // on input (the Unicode version is used when correct) | |
10251 LISP_STRING, string, // input or output is a Lisp_Object of type string | |
10252 LISP_BUFFER, buffer, // output is written to (point) in lisp buffer | |
10253 LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream | |
10254 LISP_OPAQUE, object, // input or output is a Lisp_Object of type opaque | |
10255 @end example | |
10256 | |
10257 When specifying the sink, use lvalues, since the macro will assign to them, | |
10258 except when the sink is an lstream or a lisp buffer. | |
10259 | |
10260 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the resulting text is | |
10261 stored in a stack-allocated buffer, which is automatically freed on | |
10262 returning from the function. However, the sink types @code{MALLOC} and | |
10263 @code{C_STRING_MALLOC} return @code{xmalloc()}ed memory. The caller is responsible | |
10264 for freeing this memory using @code{xfree()}. | |
10265 | |
10266 The macros accept the kinds of sources and sinks appropriate for | |
10267 internal and external data representation. See the type_checking_assert | |
10268 macros below for the actual allowed types. | |
10269 | |
10270 Since some sources and sinks use one argument (a Lisp_Object) to | |
10271 specify them, while others take a (pointer, length) pair, we use | |
10272 some C preprocessor trickery to allow pair arguments to be specified | |
10273 by parenthesizing them, as in the examples above. | |
10274 | |
10275 Anything prefixed by dfc_ (`data format conversion') is private. | |
10276 They are only used to implement these macros. | |
10277 | |
10278 [[Using C_STRING* is appropriate for using with external APIs that | |
10279 take null-terminated strings. For internal data, we should try to | |
10280 be '\0'-clean - i.e. allow arbitrary data to contain embedded '\0'. | |
10281 | |
10282 Sometime in the future we might allow output to C_STRING_ALLOCA or | |
10283 C_STRING_MALLOC _only_ with @code{TO_EXTERNAL_FORMAT()}, not | |
10284 @code{TO_INTERNAL_FORMAT()}.]] | |
10285 | |
10286 The above comments are not true. Frequently (most of the time, in | |
10287 fact), external strings come as zero-terminated entities, where the | |
10288 zero-termination is the only way to find out the length. Even in | |
10289 cases where you can get the length, most of the time the system will | |
10290 still use the null to signal the end of the string, and there will | |
10291 still be no way to either send in or receive a string with embedded | |
10292 nulls. In such situations, it's pointless to track the length | |
10293 because null bytes can never be in the string. We have a lot of | |
10294 operations that make it easy to operate on zero-terminated strings, | |
10295 and forcing the user the deal with the length everywhere would only | |
10296 make the code uglier and more complicated, for no gain. --ben | |
10297 | |
10298 There is no problem using the same lvalue for source and sink. | |
10299 | |
10300 Also, when pointers are required, the code (currently at least) is | |
10301 lax and allows any pointer types, either in the source or the sink. | |
10302 This makes it possible, e.g., to deal with internal format data held | |
10303 in char *'s or external format data held in WCHAR * (i.e. Unicode). | |
10304 | |
10305 Finally, whenever storage allocation is called for, extra space is | |
10306 allocated for a terminating zero, and such a zero is stored in the | |
10307 appropriate place, regardless of whether the source data was | |
10308 specified using a length or was specified as zero-terminated. This | |
10309 allows you to freely pass the resulting data, no matter how | |
10310 obtained, to a routine that expects zero termination (modulo, of | |
10311 course, that any embedded zeros in the resulting text will cause | |
10312 truncation). In fact, currently two embedded zeros are allocated | |
10313 and stored after the data result. This is to allow for the | |
10314 possibility of storing a Unicode value on output, which needs the | |
10315 two zeros. Currently, however, the two zeros are stored regardless | |
10316 of whether the conversion is internal or external and regardless of | |
10317 whether the external coding system is in fact Unicode. This | |
10318 behavior may change in the future, and you cannot rely on this -- | |
10319 the most you can rely on is that sink data in Unicode format will | |
10320 have two terminating nulls, which combine to form one Unicode null | |
10321 character. | |
10322 | |
10323 NOTE: You might ask, why are these not written as functions that | |
10324 @strong{RETURN} the converted string, since that would allow them to be used | |
10325 much more conveniently, without having to constantly declare temporary | |
10326 variables? The answer is that in fact I originally did write the | |
10327 routines that way, but that required either | |
10328 | |
10329 @itemize @bullet | |
10330 @item | |
10331 (a) calling @code{alloca()} inside of a function call, or | |
10332 @item | |
10333 (b) using expressions separated by commas and a global temporary variable, or | |
10334 @item | |
10335 (c) using the GCC extension (@{ ... @}). | |
10336 @end itemize | |
10337 | |
10338 Turned out that all of the above had bugs, all caused by GCC (hence the | |
10339 comments about "those GCC wankers" and "ream gcc up the ass"). As for | |
10340 (a), some versions of GCC (especially on Intel platforms), which had | |
10341 buggy implementations of @code{alloca()} that couldn't handle being called | |
10342 inside of a function call -- they just decremented the stack right in the | |
10343 middle of pushing args. Oops, crash with stack trashing, very bad. (b) | |
10344 was an attempt to fix (a), and that led to further GCC crashes, esp. when | |
10345 you had two such calls in a single subexpression, because GCC couldn't be | |
10346 counted upon to follow even a minimally reasonable order of execution. | |
10347 True, you can't count on one argument being evaluated before another, but | |
10348 GCC would actually interleave them so that the temp var got stomped on by | |
10349 one while the other was accessing it. So I tried (c), which was | |
10350 problematic because that GCC extension has more bugs in it than a | |
10351 termite's nest. | |
10352 | |
10353 So reluctantly I converted to the current way. Now, that was awhile ago | |
10354 (c. 1994), and it appears that the bug involving alloca in function calls | |
10355 has long since been fixed. More recently, I defined the new-dfc routines | |
10356 down below, which DO allow exactly such convenience of returning your | |
10357 args rather than store them in temp variables, and I also wrote a | |
10358 configure check to see whether @code{alloca()} causes crashes inside of function | |
10359 calls, and if so use the portable @code{alloca()} implementation in alloca.c. | |
10360 If you define TEST_NEW_DFC, the old routines get written in terms of the | |
10361 new ones, and I've had a beta put out with this on and it appeared to | |
10362 this appears to cause no problems -- so we should consider | |
10363 switching, and feel no compunctions about writing further such function- | |
10364 like @code{alloca()} routines in lieu of statement-like ones. --ben | |
10365 | |
10366 @node The Eistring API, , The DFC API, Internal Text API's | |
10367 @subsection The Eistring API | |
10368 @cindex Eistring API | |
10369 @cindex API, Eistring | |
10370 | |
10371 (This API is currently under-used) When doing simple things with | |
10372 internal text, the basic internal-format API's are enough. But to do | |
10373 things like delete or replace a substring, concatenate various strings, | |
10374 etc. is difficult to do cleanly because of the allocation issues. | |
10375 The Eistring API is designed to deal with this, and provides a clean | |
10376 way of modifying and building up internal text. (Note that the former | |
10377 lack of this API has meant that some code uses Lisp strings to do | |
10378 similar manipulations, resulting in excess garbage and increased | |
10379 garbage collection.) | |
10380 | |
10381 NOTE: The Eistring API is (or should be) Mule-correct even without | |
10382 an ASCII-compatible internal representation. | |
10383 | |
10384 @example | |
10385 #### NOTE: This is a work in progress. Neither the API nor especially | |
10386 the implementation is finished. | |
10387 | |
10388 NOTE: An Eistring is a structure that makes it easy to work with | |
10389 internally-formatted strings of data. It provides operations similar | |
10390 in feel to the standard @code{strcpy()}, @code{strcat()}, @code{strlen()}, etc., but | |
10391 | |
10392 (a) it is Mule-correct | |
10393 (b) it does dynamic allocation so you never have to worry about size | |
10394 restrictions | |
10395 (c) it comes in an @code{ALLOCA()} variety (all allocation is stack-local, | |
10396 so there is no need to explicitly clean up) as well as a @code{malloc()} | |
10397 variety | |
10398 (d) it knows its own length, so it does not suffer from standard null | |
10399 byte brain-damage -- but it null-terminates the data anyway, so | |
10400 it can be passed to standard routines | |
10401 (e) it provides a much more powerful set of operations and knows about | |
10402 all the standard places where string data might reside: Lisp_Objects, | |
10403 other Eistrings, Ibyte * data with or without an explicit length, | |
10404 ASCII strings, Ichars, etc. | |
10405 (f) it provides easy operations to convert to/from externally-formatted | |
10406 data, and is easier to use than the standard TO_INTERNAL_FORMAT | |
10407 and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal | |
10408 and external version of its data, but the external version is only | |
10409 initialized or changed when you call @code{eito_external()}.) | |
10410 | |
10411 The idea is to make it as easy to write Mule-correct string manipulation | |
10412 code as it is to write normal string manipulation code. We also make | |
10413 the API sufficiently general that it can handle multiple internal data | |
10414 formats (e.g. some fixed-width optimizing formats and a default variable | |
10415 width format) and allows for @strong{ANY} data format we might choose in the | |
10416 future for the default format, including UCS2. (In other words, we can't | |
10417 assume that the internal format is ASCII-compatible and we can't assume | |
10418 it doesn't have embedded null bytes. We do assume, however, that any | |
10419 chosen format will have the concept of null-termination.) All of this is | |
10420 hidden from the user. | |
10421 | |
10422 #### It is really too bad that we don't have a real object-oriented | |
10423 language, or at least a language with polymorphism! | |
10424 | |
10425 | |
10426 ********************************************** | |
10427 * Declaration * | |
10428 ********************************************** | |
10429 | |
10430 To declare an Eistring, either put one of the following in the local | |
10431 variable section: | |
10432 | |
10433 DECLARE_EISTRING (name); | |
10434 Declare a new Eistring and initialize it to the empy string. This | |
10435 is a standard local variable declaration and can go anywhere in the | |
10436 variable declaration section. NAME itself is declared as an | |
10437 Eistring *, and its storage declared on the stack. | |
10438 | |
10439 DECLARE_EISTRING_MALLOC (name); | |
10440 Declare and initialize a new Eistring, which uses @code{malloc()}ed | |
10441 instead of @code{ALLOCA()}ed data. This is a standard local variable | |
10442 declaration and can go anywhere in the variable declaration | |
10443 section. Once you initialize the Eistring, you will have to free | |
10444 it using @code{eifree()} to avoid memory leaks. You will need to use this | |
10445 form if you are passing an Eistring to any function that modifies | |
10446 it (otherwise, the modified data may be in stack space and get | |
10447 overwritten when the function returns). | |
10448 | |
10449 or use | |
10450 | |
10451 Eistring ei; | |
10452 void eiinit (Eistring *ei); | |
10453 void eiinit_malloc (Eistring *einame); | |
10454 If you need to put an Eistring elsewhere than in a local variable | |
10455 declaration (e.g. in a structure), declare it as shown and then | |
10456 call one of the init macros. | |
10457 | |
10458 Also note: | |
10459 | |
10460 void eifree (Eistring *ei); | |
10461 If you declared an Eistring to use @code{malloc()} to hold its data, | |
10462 or converted it to the heap using @code{eito_malloc()}, then this | |
10463 releases any data in it and afterwards resets the Eistring | |
10464 using @code{eiinit_malloc()}. Otherwise, it just resets the Eistring | |
10465 using @code{eiinit()}. | |
10466 | |
10467 | |
10468 ********************************************** | |
10469 * Conventions * | |
10470 ********************************************** | |
10471 | |
10472 - The names of the functions have been chosen, where possible, to | |
10473 match the names of @code{str*()} functions in the standard C API. | |
10474 - | |
10475 | |
10476 | |
10477 ********************************************** | |
10478 * Initialization * | |
10479 ********************************************** | |
10480 | |
10481 void eireset (Eistring *eistr); | |
10482 Initialize the Eistring to the empty string. | |
10483 | |
10484 void eicpy_* (Eistring *eistr, ...); | |
10485 Initialize the Eistring from somewhere: | |
10486 | |
10487 void eicpy_ei (Eistring *eistr, Eistring *eistr2); | |
10488 ... from another Eistring. | |
10489 void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string); | |
10490 ... from a Lisp_Object string. | |
10491 void eicpy_ch (Eistring *eistr, Ichar ch); | |
10492 ... from an Ichar (this can be a conventional C character). | |
10493 | |
10494 void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string, | |
10495 Bytecount off, Charcount charoff, | |
10496 Bytecount len, Charcount charlen); | |
10497 ... from a section of a Lisp_Object string. | |
10498 void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf, | |
10499 Bytecount off, Charcount charoff, | |
10500 Bytecount len, Charcount charlen); | |
10501 ... from a section of a Lisp_Object buffer. | |
10502 void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len); | |
10503 ... from raw internal-format data in the default internal format. | |
10504 void eicpy_rawz (Eistring *eistr, const Ibyte *data); | |
10505 ... from raw internal-format data in the default internal format | |
10506 that is "null-terminated" (the meaning of this depends on the nature | |
10507 of the default internal format). | |
10508 void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len, | |
10509 Internal_Format intfmt, Lisp_Object object); | |
10510 ... from raw internal-format data in the specified format. | |
10511 void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data, | |
10512 Internal_Format intfmt, Lisp_Object object); | |
10513 ... from raw internal-format data in the specified format that is | |
10514 "null-terminated" (the meaning of this depends on the nature of | |
10515 the specific format). | |
10516 void eicpy_c (Eistring *eistr, const Ascbyte *c_string); | |
10517 ... from an ASCII null-terminated string. Non-ASCII characters in | |
10518 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
10519 void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len); | |
10520 ... from an ASCII string, with length specified. Non-ASCII characters | |
10521 in the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
10522 void eicpy_ext (Eistring *eistr, const Extbyte *extdata, | |
10523 Lisp_Object codesys); | |
10524 ... from external null-terminated data, with coding system specified. | |
10525 void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata, | |
10526 Bytecount extlen, Lisp_Object codesys); | |
10527 ... from external data, with length and coding system specified. | |
10528 void eicpy_lstream (Eistring *eistr, Lisp_Object lstream); | |
10529 ... from an lstream; reads data till eof. Data must be in default | |
10530 internal format; otherwise, interpose a decoding lstream. | |
10531 | |
10532 | |
10533 ********************************************** | |
10534 * Getting the data out of the Eistring * | |
10535 ********************************************** | |
10536 | |
10537 Ibyte *eidata (Eistring *eistr); | |
10538 Return a pointer to the raw data in an Eistring. This is NOT | |
10539 a copy. | |
10540 | |
10541 Lisp_Object eimake_string (Eistring *eistr); | |
10542 Make a Lisp string out of the Eistring. | |
10543 | |
10544 Lisp_Object eimake_string_off (Eistring *eistr, | |
10545 Bytecount off, Charcount charoff, | |
10546 Bytecount len, Charcount charlen); | |
10547 Make a Lisp string out of a section of the Eistring. | |
10548 | |
10549 void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out, | |
10550 LVALUE: Bytecount len_out); | |
10551 Make an @code{ALLOCA()} copy of the data in the Eistring, using the | |
10552 default internal format. Due to the nature of @code{ALLOCA()}, this | |
10553 must be a macro, with all lvalues passed in as parameters. | |
10554 (More specifically, not all compilers correctly handle using | |
10555 @code{ALLOCA()} as the argument to a function call -- GCC on x86 | |
10556 didn't used to, for example.) A pointer to the @code{ALLOCA()}ed data | |
10557 is stored in PTR_OUT, and the length of the data (not including | |
10558 the terminating zero) is stored in LEN_OUT. | |
10559 | |
10560 void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out, | |
10561 LVALUE: Bytecount len_out, | |
10562 Internal_Format intfmt, Lisp_Object object); | |
10563 Like @code{eicpyout_alloca()}, but converts to the specified internal | |
10564 format. (No formats other than FORMAT_DEFAULT are currently | |
10565 implemented, and you get an assertion failure if you try.) | |
10566 | |
10567 Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out); | |
10568 Make a @code{malloc()} copy of the data in the Eistring, using the | |
10569 default internal format. This is a real function. No lvalues | |
10570 passed in. Returns the new data, and stores the length (not | |
10571 including the terminating zero) using INTLEN_OUT, unless it's | |
10572 a NULL pointer. | |
10573 | |
10574 Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt, | |
10575 Bytecount *intlen_out, Lisp_Object object); | |
10576 Like @code{eicpyout_malloc()}, but converts to the specified internal | |
10577 format. (No formats other than FORMAT_DEFAULT are currently | |
10578 implemented, and you get an assertion failure if you try.) | |
10579 | |
10580 | |
10581 ********************************************** | |
10582 * Moving to the heap * | |
10583 ********************************************** | |
10584 | |
10585 void eito_malloc (Eistring *eistr); | |
10586 Move this Eistring to the heap. Its data will be stored in a | |
10587 @code{malloc()}ed block rather than the stack. Subsequent changes to | |
10588 this Eistring will @code{realloc()} the block as necessary. Use this | |
10589 when you want the Eistring to remain in scope past the end of | |
10590 this function call. You will have to manually free the data | |
10591 in the Eistring using @code{eifree()}. | |
10592 | |
10593 void eito_alloca (Eistring *eistr); | |
10594 Move this Eistring back to the stack, if it was moved to the | |
10595 heap with @code{eito_malloc()}. This will automatically free any | |
10596 heap-allocated data. | |
10597 | |
10598 | |
10599 | |
10600 ********************************************** | |
10601 * Retrieving the length * | |
10602 ********************************************** | |
10603 | |
10604 Bytecount eilen (Eistring *eistr); | |
10605 Return the length of the internal data, in bytes. See also | |
10606 @code{eiextlen()}, below. | |
10607 Charcount eicharlen (Eistring *eistr); | |
10608 Return the length of the internal data, in characters. | |
10609 | |
10610 | |
10611 ********************************************** | |
10612 * Working with positions * | |
10613 ********************************************** | |
10614 | |
10615 Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos); | |
10616 Convert a char offset to a byte offset. | |
10617 Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos); | |
10618 Convert a byte offset to a char offset. | |
10619 Bytecount eiincpos (Eistring *eistr, Bytecount bytepos); | |
10620 Increment the given position by one character. | |
10621 Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n); | |
10622 Increment the given position by N characters. | |
10623 Bytecount eidecpos (Eistring *eistr, Bytecount bytepos); | |
10624 Decrement the given position by one character. | |
10625 Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n); | |
10626 Deccrement the given position by N characters. | |
10627 | |
10628 | |
10629 ********************************************** | |
10630 * Getting the character at a position * | |
10631 ********************************************** | |
10632 | |
10633 Ichar eigetch (Eistring *eistr, Bytecount bytepos); | |
10634 Return the character at a particular byte offset. | |
10635 Ichar eigetch_char (Eistring *eistr, Charcount charpos); | |
10636 Return the character at a particular character offset. | |
10637 | |
10638 | |
10639 ********************************************** | |
10640 * Setting the character at a position * | |
10641 ********************************************** | |
10642 | |
10643 Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr); | |
10644 Set the character at a particular byte offset. | |
10645 Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr); | |
10646 Set the character at a particular character offset. | |
10647 | |
10648 | |
10649 ********************************************** | |
10650 * Concatenation * | |
10651 ********************************************** | |
10652 | |
10653 void eicat_* (Eistring *eistr, ...); | |
10654 Concatenate onto the end of the Eistring, with data coming from the | |
10655 same places as above: | |
10656 | |
10657 void eicat_ei (Eistring *eistr, Eistring *eistr2); | |
10658 ... from another Eistring. | |
10659 void eicat_c (Eistring *eistr, Ascbyte *c_string); | |
10660 ... from an ASCII null-terminated string. Non-ASCII characters in | |
10661 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
10662 void eicat_raw (ei, const Ibyte *data, Bytecount len); | |
10663 ... from raw internal-format data in the default internal format. | |
10664 void eicat_rawz (ei, const Ibyte *data); | |
10665 ... from raw internal-format data in the default internal format | |
10666 that is "null-terminated" (the meaning of this depends on the nature | |
10667 of the default internal format). | |
10668 void eicat_lstr (ei, Lisp_Object lisp_string); | |
10669 ... from a Lisp_Object string. | |
10670 void eicat_ch (ei, Ichar ch); | |
10671 ... from an Ichar. | |
10672 | |
10673 All except the first variety are convenience functions. | |
10674 n the general case, create another Eistring from the source.) | |
10675 | |
10676 | |
10677 ********************************************** | |
10678 * Replacement * | |
10679 ********************************************** | |
10680 | |
10681 void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
10682 Bytecount len, Charcount charlen, ...); | |
10683 Replace a section of the Eistring, specifically: | |
10684 | |
10685 void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
10686 Bytecount len, Charcount charlen, Eistring *eistr2); | |
10687 ... with another Eistring. | |
10688 void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
10689 Bytecount len, Charcount charlen, Ascbyte *c_string); | |
10690 ... with an ASCII null-terminated string. Non-ASCII characters in | |
10691 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
10692 void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff, | |
10693 Bytecount len, Charcount charlen, Ichar ch); | |
10694 ... with an Ichar. | |
10695 | |
10696 void eidel (Eistring *eistr, Bytecount off, Charcount charoff, | |
10697 Bytecount len, Charcount charlen); | |
10698 Delete a section of the Eistring. | |
10699 | |
10700 | |
10701 ********************************************** | |
10702 * Converting to an external format * | |
10703 ********************************************** | |
10704 | |
10705 void eito_external (Eistring *eistr, Lisp_Object codesys); | |
10706 Convert the Eistring to an external format and store the result | |
10707 in the string. NOTE: Further changes to the Eistring will @strong{NOT} | |
10708 change the external data stored in the string. You will have to | |
10709 call @code{eito_external()} again in such a case if you want the external | |
10710 data. | |
10711 | |
10712 Extbyte *eiextdata (Eistring *eistr); | |
10713 Return a pointer to the external data stored in the Eistring as | |
10714 a result of a prior call to @code{eito_external()}. | |
10715 | |
10716 Bytecount eiextlen (Eistring *eistr); | |
10717 Return the length in bytes of the external data stored in the | |
10718 Eistring as a result of a prior call to @code{eito_external()}. | |
10719 | |
10720 | |
10721 ********************************************** | |
10722 * Searching in the Eistring for a character * | |
10723 ********************************************** | |
10724 | |
10725 Bytecount eichr (Eistring *eistr, Ichar chr); | |
10726 Charcount eichr_char (Eistring *eistr, Ichar chr); | |
10727 Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off, | |
10728 Charcount charoff); | |
10729 Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off, | |
10730 Charcount charoff); | |
10731 Bytecount eirchr (Eistring *eistr, Ichar chr); | |
10732 Charcount eirchr_char (Eistring *eistr, Ichar chr); | |
10733 Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off, | |
10734 Charcount charoff); | |
10735 Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off, | |
10736 Charcount charoff); | |
10737 | |
10738 | |
10739 ********************************************** | |
10740 * Searching in the Eistring for a string * | |
10741 ********************************************** | |
10742 | |
10743 Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2); | |
10744 Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2); | |
10745 Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off, | |
10746 Charcount charoff); | |
10747 Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2, | |
10748 Bytecount off, Charcount charoff); | |
10749 Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2); | |
10750 Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2); | |
10751 Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off, | |
10752 Charcount charoff); | |
10753 Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2, | |
10754 Bytecount off, Charcount charoff); | |
10755 | |
10756 Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string); | |
10757 Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string); | |
10758 Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off, | |
10759 Charcount charoff); | |
10760 Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string, | |
10761 Bytecount off, Charcount charoff); | |
10762 Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string); | |
10763 Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string); | |
10764 Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string, | |
10765 Bytecount off, Charcount charoff); | |
10766 Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string, | |
10767 Bytecount off, Charcount charoff); | |
10768 | |
10769 | |
10770 ********************************************** | |
10771 * Comparison * | |
10772 ********************************************** | |
10773 | |
10774 int eicmp_* (Eistring *eistr, ...); | |
10775 int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
10776 Bytecount len, Charcount charlen, ...); | |
10777 int eicasecmp_* (Eistring *eistr, ...); | |
10778 int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
10779 Bytecount len, Charcount charlen, ...); | |
10780 int eicasecmp_i18n_* (Eistring *eistr, ...); | |
10781 int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
10782 Bytecount len, Charcount charlen, ...); | |
10783 | |
10784 Compare the Eistring with the other data. Return value same as | |
10785 from strcmp. The @code{*} is either @code{ei} for another Eistring (in | |
10786 which case @code{...} is an Eistring), or @code{c} for a pure-ASCII string | |
10787 (in which case @code{...} is a pointer to that string). For anything | |
10788 more complex, first create an Eistring out of the source. | |
10789 Comparison is either simple (@code{eicmp_...}), ASCII case-folding | |
10790 (@code{eicasecmp_...}), or multilingual case-folding | |
10791 (@code{eicasecmp_i18n_...}). | |
10792 | |
10793 | |
10794 More specifically, the prototypes are: | |
10795 | |
10796 int eicmp_ei (Eistring *eistr, Eistring *eistr2); | |
10797 int eicmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
10798 Bytecount len, Charcount charlen, Eistring *eistr2); | |
10799 int eicasecmp_ei (Eistring *eistr, Eistring *eistr2); | |
10800 int eicasecmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
10801 Bytecount len, Charcount charlen, Eistring *eistr2); | |
10802 int eicasecmp_i18n_ei (Eistring *eistr, Eistring *eistr2); | |
10803 int eicasecmp_i18n_off_ei (Eistring *eistr, Bytecount off, | |
10804 Charcount charoff, Bytecount len, | |
10805 Charcount charlen, Eistring *eistr2); | |
10806 | |
10807 int eicmp_c (Eistring *eistr, Ascbyte *c_string); | |
10808 int eicmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
10809 Bytecount len, Charcount charlen, Ascbyte *c_string); | |
10810 int eicasecmp_c (Eistring *eistr, Ascbyte *c_string); | |
10811 int eicasecmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
10812 Bytecount len, Charcount charlen, | |
10813 Ascbyte *c_string); | |
10814 int eicasecmp_i18n_c (Eistring *eistr, Ascbyte *c_string); | |
10815 int eicasecmp_i18n_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
10816 Bytecount len, Charcount charlen, | |
10817 Ascbyte *c_string); | |
10818 | |
10819 | |
10820 ********************************************** | |
10821 * Case-changing the Eistring * | |
10822 ********************************************** | |
10823 | |
10824 void eilwr (Eistring *eistr); | |
10825 Convert all characters in the Eistring to lowercase. | |
10826 void eiupr (Eistring *eistr); | |
10827 Convert all characters in the Eistring to uppercase. | |
10828 @end example | |
10829 | |
10830 @node Coding for Mule, CCL, Internal Text API's, Multilingual Support | |
10831 @section Coding for Mule | |
10832 @cindex coding for Mule | |
10833 @cindex Mule, coding for | |
10834 | |
10835 Although Mule support is not compiled by default in XEmacs, many people | |
10836 are using it, and we consider it crucial that new code works correctly | |
10837 with multibyte characters. This is not hard; it is only a matter of | |
10838 following several simple user-interface guidelines. Even if you never | |
10839 compile with Mule, with a little practice you will find it quite easy | |
10840 to code Mule-correctly. | |
10841 | |
10842 Note that these guidelines are not necessarily tied to the current Mule | |
10843 implementation; they are also a good idea to follow on the grounds of | |
10844 code generalization for future I18N work. | |
10845 | |
10846 @menu | |
10847 * Character-Related Data Types:: | |
10848 * Working With Character and Byte Positions:: | |
10849 * Conversion to and from External Data:: | |
10850 * General Guidelines for Writing Mule-Aware Code:: | |
10851 * An Example of Mule-Aware Code:: | |
10852 * Mule-izing Code:: | |
10853 @end menu | |
10854 | |
10855 @node Character-Related Data Types, Working With Character and Byte Positions, Coding for Mule, Coding for Mule | |
10856 @subsection Character-Related Data Types | |
10857 @cindex character-related data types | |
10858 @cindex data types, character-related | |
10859 | |
10860 First, let's review the basic character-related datatypes used by | |
10861 XEmacs. Note that some of the separate @code{typedef}s are not | |
10862 mandatory, but they improve clarity of code a great deal, because one | |
10863 glance at the declaration can tell the intended use of the variable. | |
10864 | |
10865 @table @code | |
10866 @item Ichar | |
10867 @cindex Ichar | |
10868 An @code{Ichar} holds a single Emacs character. | |
10869 | |
10870 Obviously, the equality between characters and bytes is lost in the Mule | |
10871 world. Characters can be represented by one or more bytes in the | |
10872 buffer, and @code{Ichar} is a C type large enough to hold any | |
10873 character. (This currently isn't quite true for ISO 10646, which | |
10874 defines a character as a 31-bit non-negative quantity, while XEmacs | |
10875 characters are only 30-bits. This is irrelevant, unless you are | |
10876 considering using the ISO 10646 private groups to support really large | |
10877 private character sets---in particular, the Mule character set!---in | |
10878 a version of XEmacs using Unicode internally.) | |
10879 | |
10880 Without Mule support, an @code{Ichar} is equivalent to an | |
10881 @code{unsigned char}. [[This doesn't seem to be true; @file{lisp.h} | |
10882 unconditionally @samp{typedef}s @code{Ichar} to @code{int}.]] | |
10883 | |
10884 @item Ibyte | |
10885 @cindex Ibyte | |
10886 The data representing the text in a buffer or string is logically a set | |
10887 of @code{Ibyte}s. | |
10888 | |
10889 XEmacs does not work with the same character formats all the time; when | |
10890 reading characters from the outside, it decodes them to an internal | |
10891 format, and likewise encodes them when writing. @code{Ibyte} (in fact | |
10892 @code{unsigned char}) is the basic unit of XEmacs internal buffers and | |
10893 strings format. An @code{Ibyte *} is the type that points at text | |
10894 encoded in the variable-width internal encoding. | |
10895 | |
10896 One character can correspond to one or more @code{Ibyte}s. In the | |
10897 current Mule implementation, an ASCII character is represented by the | |
10898 same @code{Ibyte}, and other characters are represented by a sequence | |
10899 of two or more @code{Ibyte}s. (This will also be true of an | |
10900 implementation using UTF-8 as the internal encoding. In fact, only code | |
10901 that implements character code conversions and a very few macros used to | |
10902 implement motion by whole characters will notice the difference between | |
10903 UTF-8 and the Mule encoding.) | |
10904 | |
10905 Without Mule support, there are exactly 256 characters, implicitly | |
10906 Latin-1, and each character is represented using one @code{Ibyte}, and | |
10907 there is a one-to-one correspondence between @code{Ibyte}s and | |
10908 @code{Ichar}s. | |
10909 | |
10910 @item Charxpos | |
10911 @item Charbpos | |
10912 @itemx Charcount | |
10913 @cindex Charxpos | |
10914 @cindex Charbpos | |
10915 @cindex Charcount | |
10916 A @code{Charbpos} represents a character position in a buffer. A | |
10917 @code{Charcount} represents a number (count) of characters. Logically, | |
10918 subtracting two @code{Charbpos} values yields a @code{Charcount} value. | |
10919 When representing a character position in a string, we just use | |
10920 @code{Charcount} directly. The reason for having a separate typedef for | |
10921 buffer positions is that they are 1-based, whereas string positions are | |
10922 0-based and hence string counts and positions can be freely intermixed (a | |
10923 string position is equivalent to the count of characters from the | |
10924 beginning). When representing a character position that could be either | |
10925 in a buffer or string (for example, in the extent code), @code{Charxpos} | |
10926 is used. Although all of these are @code{typedef}ed to | |
10927 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make | |
10928 it clear what sort of position is being used. | |
10929 | |
10930 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the | |
10931 only ones that are ever visible to Lisp. | |
10932 | |
10933 @item Bytexpos | |
10934 @itemx Bytecount | |
10935 @cindex Bytebpos | |
10936 @cindex Bytecount | |
10937 A @code{Bytebpos} represents a byte position in a buffer. A | |
10938 @code{Bytecount} represents the distance between two positions, in | |
10939 bytes. Byte positions in strings use @code{Bytecount}, and for byte | |
10940 positions that can be either in a buffer or string, @code{Bytexpos} is | |
10941 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and | |
10942 @code{Bytecount} is the same as the relationship between | |
10943 @code{Charxpos}, @code{Charbpos} and @code{Charcount}. | |
10944 | |
10945 @item Extbyte | |
10946 @cindex Extbyte | |
10947 When dealing with the outside world, XEmacs works with @code{Extbyte}s, | |
10948 which are equivalent to @code{char}. The distance between two | |
10949 @code{Extbyte}s is a @code{Bytecount}, since external text is a | |
10950 byte-by-byte encoding. Extbytes occur mainly at the transition point | |
10951 between internal text and external functions. XEmacs code should not, | |
10952 if it can possibly avoid it, do any actual manipulation using external | |
10953 text, since its format is completely unpredictable (it might not even be | |
10954 ASCII-compatible). | |
10955 @end table | |
10956 | |
10957 @node Working With Character and Byte Positions, Conversion to and from External Data, Character-Related Data Types, Coding for Mule | |
10958 @subsection Working With Character and Byte Positions | |
10959 @cindex character and byte positions, working with | |
10960 @cindex byte positions, working with character and | |
10961 @cindex positions, working with character and byte | |
10962 | |
10963 Now that we have defined the basic character-related types, we can look | |
10964 at the macros and functions designed for work with them and for | |
10965 conversion between them. Most of these macros are defined in | |
10966 @file{buffer.h}, and we don't discuss all of them here, but only the | |
10967 most important ones. Examining the existing code is the best way to | |
10968 learn about them. | |
10969 | |
10970 @table @code | |
10971 @item MAX_ICHAR_LEN | |
10972 @cindex MAX_ICHAR_LEN | |
10973 This preprocessor constant is the maximum number of buffer bytes to | |
10974 represent an Emacs character in the variable width internal encoding. | |
10975 It is useful when allocating temporary strings to keep a known number of | |
10976 characters. For instance: | |
10977 | |
10978 @example | |
10979 @group | |
10980 @{ | |
10981 Charcount cclen; | |
10982 ... | |
10983 @{ | |
10984 /* Allocate place for @var{cclen} characters. */ | |
10985 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN); | |
10986 ... | |
10987 @end group | |
10988 @end example | |
10989 | |
10990 If you followed the previous section, you can guess that, logically, | |
10991 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces | |
10992 a @code{Bytecount} value. | |
10993 | |
10994 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4. | |
10995 Without Mule, it is 1. In a mature Unicode-based XEmacs, it will also | |
10996 be 4 (since all Unicode characters can be encoded in UTF-8 in 4 bytes or | |
10997 less), but some versions may use up to 6, in order to use the large | |
10998 private space provided by ISO 10646 to ``mirror'' the Mule code space. | |
10999 | |
11000 @item itext_ichar | |
11001 @itemx set_itext_ichar | |
11002 @cindex itext_ichar | |
11003 @cindex set_itext_ichar | |
11004 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and | |
11005 returns the @code{Ichar} stored at that position. If it were a | |
11006 function, its prototype would be: | |
11007 | |
11008 @example | |
11009 Ichar itext_ichar (Ibyte *p); | |
11010 @end example | |
11011 | |
11012 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte | |
11013 position. It returns the number of bytes stored: | |
11014 | |
11015 @example | |
11016 Bytecount set_itext_ichar (Ibyte *p, Ichar c); | |
11017 @end example | |
11018 | |
11019 It is important to note that @code{set_itext_ichar} is safe only for | |
11020 appending a character at the end of a buffer, not for overwriting a | |
11021 character in the middle. This is because the width of characters | |
11022 varies, and @code{set_itext_ichar} cannot resize the string if it | |
11023 writes, say, a two-byte character where a single-byte character used to | |
11024 reside. | |
11025 | |
11026 A typical use of @code{set_itext_ichar} can be demonstrated by this | |
11027 example, which copies characters from buffer @var{buf} to a temporary | |
11028 string of Ibytes. | |
11029 | |
11030 @example | |
11031 @group | |
11032 @{ | |
11033 Charbpos pos; | |
11034 for (pos = beg; pos < end; pos++) | |
11035 @{ | |
11036 Ichar c = BUF_FETCH_CHAR (buf, pos); | |
11037 p += set_itext_ichar (buf, c); | |
11038 @} | |
11039 @} | |
11040 @end group | |
11041 @end example | |
11042 | |
11043 Note how @code{set_itext_ichar} is used to store the @code{Ichar} | |
11044 and increment the counter, at the same time. | |
11045 | |
11046 @item INC_IBYTEPTR | |
11047 @itemx DEC_IBYTEPTR | |
11048 @cindex INC_IBYTEPTR | |
11049 @cindex DEC_IBYTEPTR | |
11050 These two macros increment and decrement an @code{Ibyte} pointer, | |
11051 respectively. They will adjust the pointer by the appropriate number of | |
11052 bytes according to the byte length of the character stored there. Both | |
11053 macros assume that the memory address is located at the beginning of a | |
11054 valid character. | |
11055 | |
11056 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)} | |
11057 simply expand to @code{p++} and @code{p--}, respectively. | |
11058 | |
11059 @item bytecount_to_charcount | |
11060 @cindex bytecount_to_charcount | |
11061 Given a pointer to a text string and a length in bytes, return the | |
11062 equivalent length in characters. | |
11063 | |
11064 @example | |
11065 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc); | |
11066 @end example | |
11067 | |
11068 @item charcount_to_bytecount | |
11069 @cindex charcount_to_bytecount | |
11070 Given a pointer to a text string and a length in characters, return the | |
11071 equivalent length in bytes. | |
11072 | |
11073 @example | |
11074 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc); | |
11075 @end example | |
11076 | |
11077 @item itext_n_addr | |
11078 @cindex itext_n_addr | |
11079 Return a pointer to the beginning of the character offset @var{cc} (in | |
11080 characters) from @var{p}. | |
11081 | |
11082 @example | |
11083 Ibyte *itext_n_addr (Ibyte *p, Charcount cc); | |
11084 @end example | |
11085 @end table | |
11086 | |
11087 @node Conversion to and from External Data, General Guidelines for Writing Mule-Aware Code, Working With Character and Byte Positions, Coding for Mule | |
11088 @subsection Conversion to and from External Data | |
11089 @cindex conversion to and from external data | |
11090 @cindex external data, conversion to and from | |
11091 | |
11092 When an external function, such as a C library function, returns a | |
11093 @code{char} pointer, you should almost never treat it as @code{Ibyte}. | |
11094 This is because these returned strings may contain 8bit characters which | |
11095 can be misinterpreted by XEmacs, and cause a crash. Likewise, when | |
11096 exporting a piece of internal text to the outside world, you should | |
11097 always convert it to an appropriate external encoding, lest the internal | |
11098 stuff (such as the infamous \201 characters) leak out. | |
11099 | |
11100 The interface to conversion between the internal and external | |
11101 representations of text are the numerous conversion macros defined in | |
11102 @file{buffer.h}. There used to be a fixed set of external formats | |
11103 supported by these macros, but now any coding system can be used with | |
11104 them. The coding system alias mechanism is used to create the | |
11105 following logical coding systems, which replace the fixed external | |
11106 formats. The (dontusethis-set-symbol-value-handler) mechanism was | |
11107 enhanced to make this possible (more work on that is needed). | |
11108 | |
11109 Often useful coding systems: | |
11110 | |
11111 @table @code | |
11112 @item Qbinary | |
11113 This is the simplest format and is what we use in the absence of a more | |
11114 appropriate format. This converts according to the @code{binary} coding | |
11115 system: | |
11116 | |
11117 @enumerate a | |
11118 @item | |
11119 On input, bytes 0--255 are converted into (implicitly Latin-1) | |
11120 characters 0--255. A non-Mule xemacs doesn't really know about | |
11121 different character sets and the fonts to display them, so the bytes can | |
11122 be treated as text in different 1-byte encodings by simply setting the | |
11123 appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual | |
11124 editor if, for example, different fonts are used to display text in | |
11125 different buffers, faces, or windows. The specifier mechanism gives the | |
11126 user complete control over this kind of behavior. | |
11127 @item | |
11128 On output, characters 0--255 are converted into bytes 0--255 and other | |
11129 characters are converted into @samp{~}. | |
11130 @end enumerate | |
11131 | |
11132 @item Qnative | |
11133 Format used for the external Unix environment---@code{argv[]}, stuff | |
11134 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc. | |
11135 This is encoded according to the encoding specified by the current locale. | |
11136 [[This is dangerous; current locale is user preference, and the system | |
11137 is probably going to be something else. Is there anything we can do | |
11138 about it?]] | |
11139 | |
11140 @item Qfile_name | |
11141 Format used for filenames. This is normally the same as @code{Qnative}, | |
11142 but the two should be distinguished for clarity and possible future | |
11143 separation -- and also because @code{Qfile_name} can be changed using either | |
11144 the @code{file-name-coding-system} or @code{pathname-coding-system} (now | |
11145 obsolete) variables. | |
11146 | |
11147 @item Qctext | |
11148 Compound-text format. This is the standard X11 format used for data | |
11149 stored in properties, selections, and the like. This is an 8-bit | |
11150 no-lock-shift ISO2022 coding system. This is a real coding system, | |
11151 unlike @code{Qfile_name}, which is user-definable. | |
11152 | |
11153 @item Qmswindows_tstr | |
11154 Used for external data in all MS Windows functions that are declared to | |
11155 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either | |
11156 @code{Qmswindows_multibyte} (a locale-specific encoding, same as | |
11157 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether | |
11158 XEmacs is being run under Windows 9X or Windows NT/2000/XP. | |
11159 @end table | |
11160 | |
11161 Many other coding systems are provided by default. | |
11162 | |
11163 There are two fundamental macros to convert between external and | |
11164 internal format, as well as various convenience macros to simplify the | |
11165 most common operations. | |
11166 | |
11167 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and | |
11168 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments | |
11169 each of these receives are a source type, a source, a sink type, a sink, | |
11170 and a coding system (or a symbol naming a coding system). | |
11171 | |
11172 A typical call looks like | |
11173 @example | |
11174 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name); | |
11175 @end example | |
11176 | |
11177 which means that the contents of the lisp string @code{str} are written | |
11178 to a malloc'ed memory area which will be pointed to by @code{ptr}, after | |
11179 the function returns. The conversion will be done using the | |
11180 @code{file-name} coding system, which will be controlled by the user | |
11181 indirectly by setting or binding the variable | |
11182 @code{file-name-coding-system}. | |
11183 | |
11184 Some sources and sinks require two C variables to specify. We use some | |
11185 preprocessor magic to allow different source and sink types, and even | |
11186 different numbers of arguments to specify different types of sources and | |
11187 sinks. | |
11188 | |
11189 So we can have a call that looks like | |
11190 @example | |
11191 TO_INTERNAL_FORMAT (DATA, (ptr, len), | |
11192 MALLOC, (ptr, len), | |
11193 coding_system); | |
11194 @end example | |
11195 | |
11196 The parenthesized argument pairs are required to make the preprocessor | |
11197 magic work. | |
11198 | |
11199 Here are the different source and sink types: | |
11200 | |
11201 @table @code | |
11202 @item @code{DATA, (ptr, len),} | |
11203 input data is a fixed buffer of size @var{len} at address @var{ptr} | |
11204 @item @code{ALLOCA, (ptr, len),} | |
11205 output data is placed in an @code{alloca()}ed buffer of size @var{len} pointed to by @var{ptr} | |
11206 @item @code{MALLOC, (ptr, len),} | |
11207 output data is in a @code{malloc()}ed buffer of size @var{len} pointed to by @var{ptr} | |
11208 @item @code{C_STRING_ALLOCA, ptr,} | |
11209 equivalent to @code{ALLOCA (ptr, len_ignored)} on output. | |
11210 @item @code{C_STRING_MALLOC, ptr,} | |
11211 equivalent to @code{MALLOC (ptr, len_ignored)} on output | |
11212 @item @code{C_STRING, ptr,} | |
11213 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input | |
11214 @item @code{LISP_STRING, string,} | |
11215 input or output is a Lisp_Object of type string | |
11216 @item @code{LISP_BUFFER, buffer,} | |
11217 output is written to @code{(point)} in lisp buffer @var{buffer} | |
11218 @item @code{LISP_LSTREAM, lstream,} | |
11219 input or output is a Lisp_Object of type lstream | |
11220 @item @code{LISP_OPAQUE, object,} | |
11221 input or output is a Lisp_Object of type opaque | |
11222 @end table | |
11223 | |
11224 A source type of @code{C_STRING} or a sink type of | |
11225 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where | |
11226 the external API is not '\0'-byte-clean -- i.e. it expects strings to be | |
11227 terminated with a null byte. For external API's that are in fact | |
11228 '\0'-byte-clean, we should of course not use these. | |
11229 | |
11230 The sinks to be specified must be lvalues, unless they are the lisp | |
11231 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}. | |
11232 | |
11233 There is no problem using the same lvalue for source and sink. | |
11234 | |
11235 Garbage collection is inhibited during these conversion operations, so | |
11236 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}. | |
11237 | |
11238 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the | |
11239 resulting text is stored in a stack-allocated buffer, which is | |
11240 automatically freed on returning from the function. However, the sink | |
11241 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed | |
11242 memory. The caller is responsible for freeing this memory using | |
11243 @code{xfree()}. | |
11244 | |
11245 Note that it doesn't make sense for @code{LISP_STRING} to be a source | |
11246 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}. | |
11247 You'll get an assertion failure if you try. | |
11248 | |
11249 99% of conversions involve raw data or Lisp strings as both source and | |
11250 sink, and usually data is output as @code{alloca()}, or sometimes | |
11251 @code{xmalloc()}. For this reason, convenience macros are defined for | |
11252 many types of conversions involving raw data and/or Lisp strings, | |
11253 especially when the output is an @code{alloca()}ed string. (When the | |
11254 destination is a Lisp string, there are other functions that should be | |
11255 used instead -- @code{build_ext_string()} and @code{make_ext_string()}, | |
11256 for example.) The convenience macros are of two types -- the older kind | |
11257 that store the result into a specified variable, and the newer kind that | |
11258 return the result. The newer kind of macros don't exist when the output | |
11259 is sized data, because that would have two return values. NOTE: All | |
11260 convenience macros are ultimately defined in terms of | |
11261 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any | |
11262 comments above about the workings of these macros also apply to all | |
11263 convenience macros. | |
11264 | |
11265 A typical old-style convenience macro is | |
11266 | |
11267 @example | |
11268 C_STRING_TO_EXTERNAL (in, out, codesys); | |
11269 @end example | |
11270 | |
11271 This is equivalent to | |
11272 | |
11273 @example | |
11274 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys); | |
11275 @end example | |
11276 | |
11277 but is easier to write and somewhat clearer, since it clearly identifies | |
11278 the arguments without the clutter of having the preprocessor types mixed | |
11279 in. | |
11280 | |
11281 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src, | |
11282 codesys)}, which @emph{returns} the converted data (still in | |
11283 @code{alloca()} space). This is far more convenient for most | |
11284 operations. | |
11285 | |
11286 @node General Guidelines for Writing Mule-Aware Code, An Example of Mule-Aware Code, Conversion to and from External Data, Coding for Mule | |
11287 @subsection General Guidelines for Writing Mule-Aware Code | |
11288 @cindex writing Mule-aware code, general guidelines for | |
11289 @cindex Mule-aware code, general guidelines for writing | |
11290 @cindex code, general guidelines for writing Mule-aware | |
11291 | |
11292 This section contains some general guidance on how to write Mule-aware | |
11293 code, as well as some pitfalls you should avoid. | |
11294 | |
11295 @table @emph | |
11296 @item Never use @code{char} and @code{char *}. | |
11297 In XEmacs, the use of @code{char} and @code{char *} is almost always a | |
11298 mistake. If you want to manipulate an Emacs character from ``C'', use | |
11299 @code{Ichar}. If you want to examine a specific octet in the internal | |
11300 format, use @code{Ibyte}. If you want a Lisp-visible character, use a | |
11301 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move | |
11302 through the internal text, use @code{Ibyte *}. Also note that you | |
11303 almost certainly do not need @code{Ichar *}. Other typedefs to clarify | |
11304 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary}, | |
11305 @code{UChar_Binary}, and @code{CIbyte}. | |
11306 | |
11307 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}. | |
11308 The whole point of using different types is to avoid confusion about the | |
11309 use of certain variables. Lest this effect be nullified, you need to be | |
11310 careful about using the right types. | |
11311 | |
11312 @item Always convert external data | |
11313 It is extremely important to always convert external data, because | |
11314 XEmacs can crash if unexpected 8-bit sequences are copied to its internal | |
11315 buffers literally. | |
11316 | |
11317 This means that when a system function, such as @code{readdir}, returns | |
11318 a string, you normally need to convert it using one of the conversion macros | |
11319 described in the previous chapter, before passing it further to Lisp. | |
11320 | |
11321 Actually, most of the basic system functions that accept '\0'-terminated | |
11322 string arguments, like @code{stat()} and @code{open()}, have | |
11323 @strong{encapsulated} equivalents that do the internal to external | |
11324 conversion themselves. The encapsulated equivalents have a @code{qxe_} | |
11325 prefix and have string arguments of type @code{Ibyte *}, and you can | |
11326 pass internally encoded data to them, often from a Lisp string using | |
11327 @code{XSTRING_DATA}. (A better design might be to provide versions that | |
11328 accept Lisp strings directly.) [[Really? Then they'd either take | |
11329 @code{Lisp_Object}s and need to check type, or they'd take | |
11330 @code{Lisp_String}s, and violate the rules about passing any of the | |
11331 specific Lisp types.]] | |
11332 | |
11333 Also note that many internal functions, such as @code{make_string}, | |
11334 accept Ibytes, which removes the need for them to convert the data they | |
11335 receive. This increases efficiency because that way external data needs | |
11336 to be decoded only once, when it is read. After that, it is passed | |
11337 around in internal format. | |
11338 | |
11339 @item Do all work in internal format | |
11340 External-formatted data is completely unpredictable in its format. It | |
11341 may be fixed-width Unicode (not even ASCII compatible); it may be a | |
11342 modal encoding, in | |
11343 which case some occurrences of (e.g.) the slash character may be part of | |
11344 two-byte Asian-language characters, and a naive attempt to split apart a | |
11345 pathname by slashes will fail; etc. Internal-format text should be | |
11346 converted to external format only at the point where an external API is | |
11347 actually called, and the first thing done after receiving | |
11348 external-format text from an external API should be to convert it to | |
11349 internal text. | |
11350 @end table | |
11351 | |
11352 @node An Example of Mule-Aware Code, Mule-izing Code, General Guidelines for Writing Mule-Aware Code, Coding for Mule | |
11353 @subsection An Example of Mule-Aware Code | |
11354 @cindex code, an example of Mule-aware | |
11355 @cindex Mule-aware code, an example of | |
11356 | |
11357 As an example of Mule-aware code, we will analyze the @code{string} | |
11358 function, which conses up a Lisp string from the character arguments it | |
11359 receives. Here is the definition, pasted from @code{alloc.c}: | |
11360 | |
11361 @example | |
11362 @group | |
11363 DEFUN ("string", Fstring, 0, MANY, 0, /* | |
11364 Concatenate all the argument characters and make the result a string. | |
11365 */ | |
11366 (int nargs, Lisp_Object *args)) | |
11367 @{ | |
11368 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN); | |
11369 Ibyte *p = storage; | |
11370 | |
11371 for (; nargs; nargs--, args++) | |
11372 @{ | |
11373 Lisp_Object lisp_char = *args; | |
11374 CHECK_CHAR_COERCE_INT (lisp_char); | |
11375 p += set_itext_ichar (p, XCHAR (lisp_char)); | |
11376 @} | |
11377 return make_string (storage, p - storage); | |
11378 @} | |
11379 @end group | |
11380 @end example | |
11381 | |
11382 Now we can analyze the source line by line. | |
11383 | |
11384 Obviously, string will be as long as there are arguments to the | |
11385 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs} | |
11386 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs} | |
11387 @code{Ichar}s to fit in the string. | |
11388 | |
11389 Then, the loop checks that each element is a character, converting | |
11390 integers in the process. Like many other functions in XEmacs, this | |
11391 function silently accepts integers where characters are expected, for | |
11392 historical and compatibility reasons. Unless you know what you are | |
11393 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)} | |
11394 extracts the @code{Ichar} from the @code{Lisp_Object}, and | |
11395 @code{set_itext_ichar} stores it to storage, increasing @code{p} in | |
11396 the process. | |
11397 | |
11398 Other instructive examples of correct coding under Mule can be found all | |
11399 over the XEmacs code. For starters, I recommend | |
11400 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have | |
11401 understood this section of the manual and studied the examples, you can | |
11402 proceed writing new Mule-aware code. | |
11403 | |
11404 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule | |
11405 @subsection Mule-izing Code | |
11406 | |
11407 A lot of code is written without Mule in mind, and needs to be made | |
11408 Mule-correct or "Mule-ized". There is really no substitute for | |
11409 line-by-line analysis when doing this, but the following checklist can | |
11410 help: | |
11411 | |
11412 @itemize @bullet | |
11413 @item | |
11414 Check all uses of @code{XSTRING_DATA}. | |
11415 @item | |
11416 Check all uses of @code{build_string} and @code{make_string}. | |
11417 @item | |
11418 Check all uses of @code{tolower} and @code{toupper}. | |
11419 @item | |
11420 Check object print methods. | |
11421 @item | |
11422 Check for use of functions such as @code{write_c_string}, | |
11423 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}. | |
11424 @item | |
11425 Check all occurrences of @code{char} and correct to one of the other | |
11426 typedefs described above. | |
11427 @item | |
11428 Check all existing uses of @code{TO_EXTERNAL_FORMAT}, | |
11429 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for | |
11430 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}). | |
11431 @item | |
11432 In Windows code, string literals may need to be encapsulated with @code{XETEXT}. | |
11433 @end itemize | |
11434 | |
11435 @node CCL, Microsoft Windows-Related Multilingual Issues, Coding for Mule, Multilingual Support | |
11436 @section CCL | |
11437 @cindex CCL | |
11438 | |
11439 @example | |
11440 MACHINE CODE: | |
11441 | |
11442 The machine code consists of a vector of 32-bit words. | |
11443 The first such word specifies the start of the EOF section of the code; | |
11444 this is the code executed to handle any stuff that needs to be done | |
11445 (e.g. designating back to ASCII and left-to-right mode) after all | |
11446 other encoded/decoded data has been written out. This is not used for | |
11447 charset CCL programs. | |
11448 | |
11449 REGISTER: 0..7 -- referred by RRR or rrr | |
11450 | |
11451 OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT | |
11452 TTTTT (5-bit): operator type | |
11453 RRR (3-bit): register number | |
11454 XXXXXXXXXXXXXXXX (15-bit): | |
11455 CCCCCCCCCCCCCCC: constant or address | |
11456 000000000000rrr: register number | |
11457 | |
11458 AAAA: 00000 + | |
11459 00001 - | |
11460 00010 * | |
11461 00011 / | |
11462 00100 % | |
11463 00101 & | |
11464 00110 | | |
11465 00111 ~ | |
11466 | |
11467 01000 << | |
11468 01001 >> | |
11469 01010 <8 | |
11470 01011 >8 | |
11471 01100 // | |
11472 01101 not used | |
11473 01110 not used | |
11474 01111 not used | |
11475 | |
11476 10000 < | |
11477 10001 > | |
11478 10010 == | |
11479 10011 <= | |
11480 10100 >= | |
11481 10101 != | |
11482 | |
11483 OPERATORS: TTTTT RRR XX.. | |
11484 | |
11485 SetCS: 00000 RRR C...C RRR = C...C | |
11486 SetCL: 00001 RRR ..... RRR = c...c | |
11487 c.............c | |
11488 SetR: 00010 RRR ..rrr RRR = rrr | |
11489 SetA: 00011 RRR ..rrr RRR = array[rrr] | |
11490 C.............C size of array = C...C | |
11491 c.............c contents = c...c | |
11492 | |
11493 Jump: 00100 000 c...c jump to c...c | |
11494 JumpCond: 00101 RRR c...c if (!RRR) jump to c...c | |
11495 WriteJump: 00110 RRR c...c Write1 RRR, jump to c...c | |
11496 WriteReadJump: 00111 RRR c...c Write1, Read1 RRR, jump to c...c | |
11497 WriteCJump: 01000 000 c...c Write1 C...C, jump to c...c | |
11498 C...C | |
11499 WriteCReadJump: 01001 RRR c...c Write1 C...C, Read1 RRR, | |
11500 C.............C and jump to c...c | |
11501 WriteSJump: 01010 000 c...c WriteS, jump to c...c | |
11502 C.............C | |
11503 S.............S | |
11504 ... | |
11505 WriteSReadJump: 01011 RRR c...c WriteS, Read1 RRR, jump to c...c | |
11506 C.............C | |
11507 S.............S | |
11508 ... | |
11509 WriteAReadJump: 01100 RRR c...c WriteA, Read1 RRR, jump to c...c | |
11510 C.............C size of array = C...C | |
11511 c.............c contents = c...c | |
11512 ... | |
11513 Branch: 01101 RRR C...C if (RRR >= 0 && RRR < C..) | |
11514 c.............c branch to (RRR+1)th address | |
11515 Read1: 01110 RRR ... read 1-byte to RRR | |
11516 Read2: 01111 RRR ..rrr read 2-byte to RRR and rrr | |
11517 ReadBranch: 10000 RRR C...C Read1 and Branch | |
11518 c.............c | |
11519 ... | |
11520 Write1: 10001 RRR ..... write 1-byte RRR | |
11521 Write2: 10010 RRR ..rrr write 2-byte RRR and rrr | |
11522 WriteC: 10011 000 ..... write 1-char C...CC | |
11523 C.............C | |
11524 WriteS: 10100 000 ..... write C..-byte of string | |
11525 C.............C | |
11526 S.............S | |
11527 ... | |
11528 WriteA: 10101 RRR ..... write array[RRR] | |
11529 C.............C size of array = C...C | |
11530 c.............c contents = c...c | |
11531 ... | |
11532 End: 10110 000 ..... terminate the execution | |
11533 | |
11534 SetSelfCS: 10111 RRR C...C RRR AAAAA= C...C | |
11535 ..........AAAAA | |
11536 SetSelfCL: 11000 RRR ..... RRR AAAAA= c...c | |
11537 c.............c | |
11538 ..........AAAAA | |
11539 SetSelfR: 11001 RRR ..Rrr RRR AAAAA= rrr | |
11540 ..........AAAAA | |
11541 SetExprCL: 11010 RRR ..Rrr RRR = rrr AAAAA c...c | |
11542 c.............c | |
11543 ..........AAAAA | |
11544 SetExprR: 11011 RRR ..rrr RRR = rrr AAAAA Rrr | |
11545 ............Rrr | |
11546 ..........AAAAA | |
11547 JumpCondC: 11100 RRR c...c if !(RRR AAAAA C..) jump to c...c | |
11548 C.............C | |
11549 ..........AAAAA | |
11550 JumpCondR: 11101 RRR c...c if !(RRR AAAAA rrr) jump to c...c | |
11551 ............rrr | |
11552 ..........AAAAA | |
11553 ReadJumpCondC: 11110 RRR c...c Read1 and JumpCondC | |
11554 C.............C | |
11555 ..........AAAAA | |
11556 ReadJumpCondR: 11111 RRR c...c Read1 and JumpCondR | |
11557 ............rrr | |
11558 ..........AAAAA | |
11559 @end example | |
11560 | |
11561 @node Microsoft Windows-Related Multilingual Issues, Modules for Internationalization, CCL, Multilingual Support | |
11562 @section Microsoft Windows-Related Multilingual Issues | |
11563 @cindex Microsoft Windows-related multilingual issues | |
11564 @cindex Windows-related multilingual issues | |
11565 @cindex multilingual issues, Windows-related | |
11566 | |
11567 @menu | |
11568 * Microsoft Documentation:: | |
11569 * Locales:: | |
11570 * More about code pages:: | |
11571 * More about locales:: | |
11572 * Unicode support under Windows:: | |
11573 * The golden rules of writing Unicode-safe code:: | |
11574 * The format of the locale in setlocale():: | |
11575 * Random other Windows I18N docs:: | |
11576 @end menu | |
11577 | |
11578 @node Microsoft Documentation, Locales, Microsoft Windows-Related Multilingual Issues, Microsoft Windows-Related Multilingual Issues | |
11579 @subsection Microsoft Documentation | |
11580 @cindex Microsoft documentation | |
11581 | |
11582 Documentation on international support in Windows is scattered throughout MSDN. | |
11583 Here are some good places to look: | |
11584 | |
11585 @enumerate | |
11586 @item | |
11587 C Runtime (CRT) intl support | |
11588 | |
11589 @enumerate | |
11590 @item | |
11591 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Internationalization | |
11592 @item | |
11593 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Global Constants -> Locale Categories | |
11594 @item | |
11595 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Language and Country/Region Strings | |
11596 @item | |
11597 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Generic-Text Mappings | |
11598 @item | |
11599 Function documentation for various functions: | |
11600 Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Alphabetic Function Reference | |
11601 e.g. _setmbcp(), setlocale(), strcoll functions | |
11602 @end enumerate | |
11603 | |
11604 @item | |
11605 Win32 API intl support | |
11606 | |
11607 @enumerate | |
11608 @item | |
11609 Platform SDK Documentation -> Base Services -> International Features | |
11610 @item | |
11611 Platform SDK Documentation -> User Interface Services -> Windows User Interface -> User Input -> Keyboard Input -> Character Messages -> International Features | |
11612 @item | |
11613 Backgrounders -> Windows Platform -> Windows 2000 -> International Support in Microsoft Windows 2000 | |
11614 @end enumerate | |
11615 | |
11616 @item | |
11617 Microsoft Layer for Unicode | |
11618 | |
11619 Platform SDK Documentation -> Windows API -> Windows 95/98/Me Programming -> Windows 95/98/Me Overviews -> Microsoft Layer for Unicode on Windows 95/98/Me Systems | |
11620 | |
11621 @item | |
11622 Look in the CRT sources! They come with VC++. See win32.c. | |
11623 @end enumerate | |
11624 | |
11625 @node Locales, More about code pages, Microsoft Documentation, Microsoft Windows-Related Multilingual Issues | |
11626 @subsection Locales, code pages, and other concepts of "language" | |
11627 @cindex locales, code pages, and other concepts of "language" | |
11628 | |
11629 First, make sure you clearly understand the difference between the C | |
11630 runtime library (CRT) and the Win32 API! See win32.c. | |
11631 | |
11632 There are various different ways of representing the vague concept | |
11633 of "language", and it can be very confusing. So: | |
11634 | |
11635 @itemize @bullet | |
11636 @item | |
11637 The CRT library has the concept of "locale", which is a | |
11638 combination of language and country, and which controls the way | |
11639 currency and dates are displayed, the encoding of data, etc. | |
11640 | |
11641 @item | |
11642 XEmacs has the concept of "language environment", more or less | |
11643 like a locale; although currently in most cases it just refers to | |
11644 the language, and no sub-language distinctions are | |
11645 made. (Exceptions are with Chinese, which has different language | |
11646 environments for Taiwan and mainland China, due to the different | |
11647 encodings and writing systems.) | |
11648 | |
11649 @item | |
11650 Windows has a number of different language concepts: | |
11651 | |
11652 @enumerate | |
11653 @item | |
11654 There are "languages" and "sublanguages", which correspond to | |
11655 the languages and countries of the C library -- e.g. LANG_ENGLISH | |
11656 and SUBLANG_ENGLISH_US. These are identified by 8-bit integers, | |
11657 called the "primary language identifier" and "sublanguage | |
11658 identifier", respectively. These are combined into a 16-bit | |
11659 integer or "language identifier" by MAKELANGID(). | |
11660 | |
11661 @item | |
11662 The language identifier in turn is combined with a "sort | |
11663 identifier" (and optionally a "sort version") to yield a 32-bit | |
11664 integer called a "locale identifier" (type LCID), which identifies | |
11665 locales -- the primary means of distinguishing language/regional | |
11666 settings and similar to C library locales. | |
11667 | |
11668 @item | |
11669 A "code page" combines the XEmacs concepts of "charset" and "coding | |
11670 system". It logically encompasses | |
11671 | |
11672 @itemize @minus | |
11673 @item | |
11674 a set of supported characters | |
11675 @item | |
11676 an enumeration associating each character with a code point, which | |
11677 is a number or number pair; there may be disjoint ranges of numbers | |
11678 supported | |
11679 @item | |
11680 a way of encoding a series of characters into a string of bytes | |
11681 @end itemize | |
11682 | |
11683 Note that the first two properties correspond to an XEmacs "charset" | |
11684 and the latter an XEmacs "coding system". | |
11685 | |
11686 Traditional encodings are either simple one-byte encodings, or | |
11687 combination one-byte/two-byte encodings (aka MBCS encodings, where MBCS | |
11688 stands for "Multibyte Character Set") with the following properties: | |
11689 | |
11690 @itemize @minus | |
11691 @item | |
11692 all characters are encoded as a one-byte or two-byte sequence | |
11693 @item | |
11694 the encoding is stateless (non-modal) | |
11695 @item | |
11696 the lower 128 bytes are compatible with ASCII | |
11697 @item | |
11698 in the higher bytes, the value of the first byte ("lead byte") | |
11699 determines whether a second byte follows | |
11700 @item | |
11701 the values used for second bytes may overlap those used for first | |
11702 bytes, and (in some encodings) include values in the low half; thus, | |
11703 moving backwards is hard, and pure-ASCII algorithms (e.g. finding the | |
11704 next slash) will fail unless rewritten to be MBCS-aware (neither of | |
11705 these problems exist in UTF-8 or in the XEmacs internal string | |
11706 encoding) | |
11707 @end itemize | |
11708 | |
11709 Recent code pages, however, do not necessarily follow these properties -- | |
11710 code pages have been expanded to include arbitrary encodings, such as | |
11711 UTF-8 (may have more than two bytes per character) and ISO-2022-JP | |
11712 (complex modal encoding). | |
11713 | |
11714 @item | |
11715 Every Windows locale has four associated code pages: ANSI (an | |
11716 international standard or some Microsoft-created approximation; the | |
11717 native code page under Windows), OEM (a DOS encoding, still used in the | |
11718 FAT file system), Mac (an encoding used on the Macintosh) and EBCDIC (a | |
11719 non-ASCII-compatible encoding used on IBM mainframes, originally based | |
11720 on the BCD or "binary-coded decimal" encoding of numbers). All code | |
11721 pages associated with a locale follow (as far as I know) the properties | |
11722 listed above for traditional code pages. More than one locale can share | |
11723 a code page -- e.g. all the Western European languages, including | |
11724 English, do. | |
11725 | |
11726 @item | |
11727 Windows also has an "input locale identifier" (aka "keyboard | |
11728 layout id") or HKL, which is a 32-bit integer composed of the | |
11729 16-bit language identifier and a 16-bit "device identifier", which | |
11730 originally specified a particular keyboard layout (e.g. the locale | |
11731 "US English" can have the QWERTY layout, the Dvorak layout, etc.), | |
11732 but has been expanded to include speech-to-text converters and | |
11733 other non-keyboard ways of inputting text. Note that both the HKL | |
11734 and LCID share the language identifier in the lower 16 bits, and in | |
11735 both cases a 0 in the upper 16 bits means "default" (sort order or | |
11736 device), providing a way to convert between HKL's, LCID's, and | |
11737 language identifiers (i.e. language/sublanguage pairs). The | |
11738 default keyboard layout for a language is (as far as I can | |
11739 determine) established using the Regional Settings control panel | |
11740 applet, where you can add input locales as combinations of language | |
11741 (actually language/sublanguage) and layout; presumably if you list | |
11742 only one input locale with a particular language, the corresponding | |
11743 layout is the default for that language. But what if you list more | |
11744 than one? You can specify a single default input locale, but there | |
11745 appears to be no way to do so on a per-language basis. | |
11746 @end enumerate | |
11747 @end itemize | |
11748 | |
11749 @node More about code pages, More about locales, Locales, Microsoft Windows-Related Multilingual Issues | |
11750 @subsection More about code pages | |
11751 @cindex more about code pages | |
11752 | |
11753 Here is what MSDN says about code pages (article "Code Pages"): | |
11754 | |
11755 @quotation | |
11756 A code page is a character set, which can include numbers, | |
11757 punctuation marks, and other glyphs. Different languages and locales | |
11758 may use different code pages. For example, ANSI code page 1252 is | |
11759 used for American English and most European languages; OEM code page | |
11760 932 is used for Japanese Kanji. | |
11761 | |
11762 A code page can be represented in a table as a mapping of characters | |
11763 to single-byte values or multibyte values. Many code pages share the | |
11764 ASCII character set for characters in the range 0x00 ?0x7F. | |
11765 | |
11766 The Microsoft run-time library uses the following types of code pages: | |
11767 | |
11768 -- System-default ANSI code page. By default, at startup the run-time | |
11769 system automatically sets the multibyte code page to the | |
11770 system-default ANSI code page, which is obtained from the operating | |
11771 system. The call | |
11772 | |
11773 setlocale ( LC_ALL, "" ); | |
11774 | |
11775 also sets the locale to the system-default ANSI code page. | |
11776 | |
11777 -- Locale code page. The behavior of a number of run-time routines is | |
11778 dependent on the current locale setting, which includes the locale | |
11779 code page. (For more information, see Locale-Dependent Routines.) By | |
11780 default, all locale-dependent routines in the Microsoft run-time | |
11781 library use the code page that corresponds to the ¡ë?locale. At | |
11782 run-time you can change or query the locale code page in use with a | |
11783 call to setlocale. | |
11784 | |
11785 -- Multibyte code page. The behavior of most of the multibyte-character | |
11786 routines in the run-time library depends on the current multibyte | |
11787 code page setting. By default, these routines use the system-default | |
11788 ANSI code page. At run-time you can query and change the multibyte | |
11789 code page with _getmbcp and _setmbcp, respectively. | |
11790 | |
11791 -- The "C" locale is defined by ANSI to correspond to the locale in | |
11792 which C programs have traditionally executed. The code page for the | |
11793 "C" locale (¡ë?code page) corresponds to the ASCII character | |
11794 set. For example, in the "C" locale, islower returns true for the | |
11795 values 0x61 ?0x7A only. In another locale, islower may return true | |
11796 for these as well as other values, as defined by that locale. | |
11797 | |
11798 Under "Locale-Dependent Routines" we notice the following setlocale | |
11799 dependencies: | |
11800 | |
11801 atof, atoi, atol (LC_NUMERIC) | |
11802 is Routines (LC_CTYPE) | |
11803 isleadbyte (LC_CTYPE) | |
11804 localeconv (LC_MONETARY, LC_NUMERIC) | |
11805 MB_CUR_MAX (LC_CTYPE) | |
11806 _mbccpy (LC_CTYPE) | |
11807 _mbclen (LC_CTYPE) | |
11808 mblen (LC_CTYPE ) | |
11809 _mbstrlen (LC_CTYPE) | |
11810 mbstowcs (LC_CTYPE) | |
11811 mbtowc (LC_CTYPE) | |
11812 printf (LC_NUMERIC, for radix character output) | |
11813 scanf (LC_NUMERIC, for radix character recognition) | |
11814 setlocale/_wsetlocale (Not applicable) | |
11815 strcoll (LC_COLLATE) | |
11816 _stricoll/_wcsicoll (LC_COLLATE) | |
11817 _strncoll/_wcsncoll (LC_COLLATE) | |
11818 _strnicoll/_wcsnicoll (LC_COLLATE) | |
11819 strftime, wcsftime (LC_TIME) | |
11820 _strlwr (LC_CTYPE) | |
11821 strtod/wcstod/strol/wcstol/strtoul/wcstoul (LC_NUMERIC, for radix character recognition) | |
11822 _strupr (LC_CTYPE) | |
11823 strxfrm/wcsxfrm (LC_COLLATE) | |
11824 tolower/towlower (LC_CTYPE) | |
11825 toupper/towupper (LC_CTYPE) | |
11826 wcstombs (LC_CTYPE) | |
11827 wctomb (LC_CTYPE) | |
11828 _wtoi/_wtol (LC_NUMERIC) | |
11829 @end quotation | |
11830 | |
11831 NOTE: The above documentation doesn't clearly explain the "locale code | |
11832 page" and "multibyte code page". These are two different values, | |
11833 maintained respectively in the CRT global variables __lc_codepage and | |
11834 __mbcodepage. Calling e.g. setlocale (LC_ALL, "JAPANESE") sets @strong{ONLY} | |
11835 __lc_codepage to 932 (the code page for Japanese), and leaves | |
11836 __mbcodepage unchanged (usually 1252, i.e. Windows-ANSI). You'd have to | |
11837 call _setmbcp() to change __mbcodepage. Figuring out from the | |
11838 documentation which routines use which code page is not so obvious. But: | |
11839 | |
11840 @itemize @bullet | |
11841 @item | |
11842 from "Interpretation of Multibyte-Character Sequences" it appears that | |
11843 all "multibyte-character routines" use the multibyte code page except for | |
11844 mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), and wctomb(). | |
11845 | |
11846 @item | |
11847 from "_setmbcp": "The multibyte code page also affects | |
11848 multibyte-character processing by the following run-time library | |
11849 routines: _exec functions _mktemp _stat _fullpath _spawn functions | |
11850 _tempnam _makepath _splitpath tmpnam. In addition, all run-time library | |
11851 routines that receive multibyte-character argv or envp program arguments | |
11852 as parameters (such as the _exec and _spawn families) process these | |
11853 strings according to the multibyte code page. Hence these routines are | |
11854 also affected by a call to _setmbcp that changes the multibyte code | |
11855 page." | |
11856 @end itemize | |
11857 | |
11858 Summary: from looking at the CRT source (which comes with VC++) and | |
11859 carefully looking through the docs, it appears that: | |
11860 | |
11861 @itemize @bullet | |
11862 @item | |
11863 the "locale code page" is used by all of the routines listed above | |
11864 under "Locale-Dependent Routines" (EXCEPT _mbccpy() and _mbclen()), | |
11865 as well as any other place that converts between multibyte and Unicode | |
11866 strings, e.g. the startup code. | |
11867 @item | |
11868 the "multibyte code page" is used in all of the *mb*() routines | |
11869 except mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), | |
11870 and wctomb(); also _exec*(), _spawn*(), _mktemp(), _stat(), _fullpath(), | |
11871 _tempnam(), _makepath(), _splitpath(), tmpnam(), and similar functions | |
11872 without the leading underscore. | |
11873 @end itemize | |
11874 | |
11875 @node More about locales, Unicode support under Windows, More about code pages, Microsoft Windows-Related Multilingual Issues | |
11876 @subsection More about locales | |
11877 @cindex more about locales | |
11878 | |
11879 In addition to the locale defined by the CRT, Windows (i.e. the Win32 API) | |
11880 defines various locales: | |
11881 | |
11882 @itemize @bullet | |
11883 @item | |
11884 The system-default locale is the locale defined under "Language | |
11885 settings for the system" in the "Regional Options" control panel. This | |
11886 is NOT user-specific, and changing it requires a reboot (at least under | |
11887 Windows 2000). The ANSI code page of the system-default locale is | |
11888 returned by GetACP(), and you can specify this code page in calls | |
11889 e.g. to MultiByteToWideChar with the constant CP_ACP. | |
11890 | |
11891 @item | |
11892 The user-default locale is the locale defined under "Settings for the | |
11893 current user" in the "Regional Options" control panel. | |
11894 | |
11895 @item | |
11896 There is a thread-local locale set by SetThreadLocale. #### What is this | |
11897 used for? | |
11898 @end itemize | |
11899 | |
11900 The Win32 API has a bunch of multibyte functions -- all of those that | |
11901 end with ...A(), and on which we spend so much effort in | |
11902 intl-encap-win32.c. These appear to ALWAYS use the ANSI code page of | |
11903 the system-default locale (GetACP(), CP_ACP). Note that this applies | |
11904 also, for example, to the encoding of filenames in all file-handling | |
11905 routines, including the CRT ones such as open(), because they pass their | |
11906 args unchanged to the Win32 API. | |
11907 | |
11908 @node Unicode support under Windows, The golden rules of writing Unicode-safe code, More about locales, Microsoft Windows-Related Multilingual Issues | |
11909 @subsection Unicode support under Windows | |
11910 @cindex unicode support under windows | |
11911 | |
11912 Basically, the whole concept of locales and code pages is broken, because | |
11913 it is extremely messy to support and does not allow for documents that use | |
11914 multiple languages simultaneously. Unicode was designed in response to | |
11915 this, the idea being to create a single character set that could be used to | |
11916 encode all the world's languages. Windows has supported Unicode since the | |
11917 beginning of the Win32 API. Internally, every code page has an associated | |
11918 table to convert the characters of that code page to and from Unicode, and | |
11919 the Win32 API itself probably (perhaps always) uses Unicode internally. | |
11920 | |
11921 Under Windows there are two different versions of all library routines that | |
11922 accept or return text, those that handle Unicode text and those handling | |
11923 "multibyte" text, i.e. variable-width ASCII-compatible text in some | |
11924 national format such as EUC or Shift-JIS. Because Windows 95 basically | |
11925 doesn't support Unicode but Windows NT does, and Microsoft doesn't provide | |
11926 any way of writing a single binary that will work on both systems and still | |
11927 use Unicode when it's available (although see below, Microsoft Layer for | |
11928 Unicode), we need to provide a way of run-time conditionalizing so you | |
11929 could have one binary for both systems. "Unicode-splitting" refers to | |
11930 writing code that will handle this properly. This means using | |
11931 Qmswindows_tstr as the external conversion format, calling the appropriate | |
11932 qxe...() Unicode-split version of library functions, and doing other things | |
11933 in certain cases, e.g. when a qxe() function is not present. | |
11934 | |
11935 Unicode support also requires that the various Windows API's be | |
11936 "Unicode-encapsulated", so that they automatically call the ANSI or | |
11937 Unicode version of the API call appropriately and handle the size | |
11938 differences in structures. What this means is: | |
11939 | |
11940 @itemize @bullet | |
11941 @item | |
11942 first, note that Windows already provides a sort of encapsulation | |
11943 of all API's that deal with text. All such API's are underlyingly | |
11944 provided in two versions, with an A or W suffix (ANSI or "wide" | |
11945 i.e. Unicode), and the compile-time constant UNICODE controls which is | |
11946 selected by the unsuffixed API. Same thing happens with structures, and | |
11947 also with types, where the generic types have names beginning with T -- | |
11948 TCHAR, LPTSTR, etc.. Unfortunately, this is compile-time only, not | |
11949 run-time, so not sufficient. (Creating the necessary run-time encoding | |
11950 is not conceptually difficult, but very time-consuming to write. It | |
11951 adds no significant overhead, and the only reason it's not standard in | |
11952 Windows is conscious marketing attempts by Microsoft to cripple Windows | |
11953 95. FUCK MICROSOFT! They even describe in a KnowledgeBase article | |
11954 exactly how to create such an API [although we don't exactly follow | |
11955 their procedure], and point out its usefulness; the procedure is also | |
11956 described more generally in Nadine Kano's book on Win32 | |
11957 internationalization -- written SIX YEARS AGO! Obviously Microsoft has | |
11958 such an API available internally.) | |
11959 | |
11960 @item | |
11961 what we do is provide an encapsulation of each standard Windows API call | |
11962 that is split into A and W versions. current theory is to avoid all | |
11963 preprocessor games; so we name the function with a prefix -- "qxe" | |
11964 currently -- and require callers to use the prefixed name. Callers need | |
11965 to explicitly use the W version of all structures, and convert text | |
11966 themselves using Qmswindows_tstr. the qxe encapsulated version will | |
11967 automatically call the appropriate A or W version depending on whether | |
11968 we're running on 9x or NT (you can force use of the A calls on NT, | |
11969 e.g. for testing purposes, using the command- line switch -nuni aka | |
11970 -no-unicode-lib-calls), and copy data between W and A versions of the | |
11971 structures as necessary. | |
11972 | |
11973 @item | |
11974 We require the caller to handle the actual translation of text to | |
11975 avoid possible overflow when dealing with fixed-size Windows | |
11976 structures. There are no such problems when copying data between | |
11977 the A and W versions because ANSI text is never larger than its | |
11978 equivalent Unicode representation. | |
11979 @end itemize | |
11980 | |
11981 NOTE NOTE NOTE: As of August 2001, Microsoft (finally! See my nasty | |
11982 comment above) released their own Unicode-encapsulation library, called | |
11983 Microsoft Layer for Unicode on Windows 95/98/Me Systems. It tries to be | |
11984 more transparent than we are, in that | |
11985 | |
11986 @itemize @bullet | |
11987 @item | |
11988 its routines do ANSI/Unicode string translation, while we don't, for | |
11989 efficiency (we already have to do internal/external conversion so it's | |
11990 no extra burden to do the proper conversion directly rather than always | |
11991 converting to Unicode and then doing a second conversion to ANSI as | |
11992 necessary) | |
11993 | |
11994 @item | |
11995 rather than requiring separately-named routines (qxeFooBar), they | |
11996 physically override the existing routines at the link level. it also | |
11997 appears that they do this BADLY, in that if you link with the MLU, you | |
11998 get an application that runs ONLY on Win9x!!! (hint -- use | |
11999 GetProcAddress()). there's still no way to create a single binary! | |
12000 fucking losers. | |
12001 | |
12002 @item | |
12003 they assume you compile with UNICODE defined, so there's no need for the | |
12004 application to explicitly use ...W structures, as we require. | |
12005 | |
12006 @item | |
12007 they also intercept windows procedures to deal with notify messages as | |
12008 necessary, which we don't do yet. | |
12009 | |
12010 @item | |
12011 they (of course) don't use Extbyte. | |
12012 @end itemize | |
12013 | |
12014 at some point (especially when they fix the single-binary problem!), we | |
12015 should consider switching. for the meantime, we'll stick with what i've | |
12016 already written. perhaps we should think about adopting some of the | |
12017 greater transparency they have; but i opted against transparency on | |
12018 purpose, to make the code easier to follow for someone who's not familiar | |
12019 with it. until our library is really complete and bug-free, we should | |
12020 think twice before doing this. | |
12021 | |
12022 According to Microsoft documentation, only the following functions are | |
12023 provided under Windows 9x to support Unicode (see MSDN page "Windows | |
12024 95/98/Me General Limitations"): | |
12025 | |
12026 EnumResourceLanguages | |
12027 EnumResourceNames | |
12028 EnumResourceTypes | |
12029 ExtTextOut | |
12030 FindResource | |
12031 FindResourceEx | |
12032 GetCharWidth | |
12033 GetCommandLine | |
12034 GetTextExtentPoint | |
12035 GetTextExtentPoint32 | |
12036 lstrcat | |
12037 lstrcpy | |
12038 lstrlen | |
12039 MessageBox | |
12040 MessageBoxEx | |
12041 MultiByteToWideChar | |
12042 TextOut | |
12043 WideCharToMultiByte | |
12044 | |
12045 also maybe GetTextExtentExPoint? (KB Q125671 "Unicode Functions Supported | |
12046 by Windows 95") | |
12047 | |
12048 However, the C runtime library provides some additional support (according | |
12049 to the CRT sources, as the docs are not very clear on this): | |
12050 | |
12051 @itemize @bullet | |
12052 @item | |
12053 wmain() is completely supported, and appropriate Unicode-formatted argv | |
12054 and envp will always be passed. | |
12055 @item | |
12056 Likewise, wWinMain() is completely supported. (NOTE: The docs are not at | |
12057 all clear on how these various entry points interact, and implies that | |
12058 a windows-subsystem program "must" use WinMain(), while a console- | |
12059 subsystem program "must" use main(), and a program compiled with UNICODE | |
12060 (which we don't, see above) "must" use the w*() versions, while a program | |
12061 not compiled this way "must" use the plain versions. In fact it appears | |
12062 that the CRT provides four different compiler entry points, namely | |
12063 w?(main|WinMain)CRTStartup, and we simply choose the one we like using | |
12064 the appropriate link flag. | |
12065 @item | |
12066 _wenviron, _wputenv | |
12067 @end itemize | |
12068 | |
12069 NOTE: | |
12070 | |
12071 @itemize @bullet | |
12072 @item | |
12073 wsetargv.obj uses routines that were buggily left out of MSVCRT; anyway, | |
12074 from looking at the source, it does NOT correctly work under Win 9x as | |
12075 it blindly calls the Unicode version of Unicode-split API's such as | |
12076 FindFirstFile) | |
12077 | |
12078 @item | |
12079 the w*() file routines are @strong{NOT} supported -- or at least, they blindly | |
12080 call the ...W() versions of the Win32 API calls. | |
12081 @end itemize | |
12082 | |
12083 @node The golden rules of writing Unicode-safe code, The format of the locale in setlocale(), Unicode support under Windows, Microsoft Windows-Related Multilingual Issues | |
12084 @subsection The golden rules of writing Unicode-safe code | |
12085 @cindex the golden rules of writing unicode-safe code | |
12086 | |
12087 @itemize @bullet | |
12088 @item | |
12089 There are no preprocessor games going on. | |
12090 | |
12091 @item | |
12092 Do not set the UNICODE constant. | |
12093 | |
12094 @item | |
12095 You need to change your code to call the Windows API prefixed with "qxe" | |
12096 functions (when they exist) and use the ...W structs instead of the | |
12097 generic ones. String arguments in the qxe functions are of type Extbyte | |
12098 *. | |
12099 | |
12100 @item | |
12101 You code is responsible for conversion of text arguments. We try to | |
12102 handle everything else -- the argument differences, the copying back and | |
12103 forth of structures, etc. Use Qmswindows_tstr and macros such as | |
12104 C_STRING_TO_TSTR. You are also responsible for interpreting and | |
12105 specifying string sizes, which have not been changed. Usually these are | |
12106 in characters, meaning you need to divide by XETCHAR_SIZE. (But, some | |
12107 functions want sizes in bytes, even with Unicode strings. Look in the | |
12108 documentation.) Use XETEXT when specifying string constants, so that | |
12109 they show up in Unicode as necessary. | |
12110 | |
12111 @item | |
12112 If you need to process external strings (in general you should not do | |
12113 this; do all your manipulations in internal format and convert at the | |
12114 point of entry into or exit from the function), use the xet...() | |
12115 functions. | |
12116 | |
12117 @item | |
12118 If you have to declare a fixed array to hold a string coming from | |
12119 Windows (and hence either multibyte or Unicode), declare it of type | |
12120 Extbyte[] and multiply the size by MAX_XETCHAR_SIZE. | |
12121 @end itemize | |
12122 | |
12123 @node The format of the locale in setlocale(), Random other Windows I18N docs, The golden rules of writing Unicode-safe code, Microsoft Windows-Related Multilingual Issues | |
12124 @subsection The format of the locale in setlocale() | |
12125 @cindex the format of the locale in setlocale() | |
12126 | |
12127 It appears that under Unix the standard format for the string in | |
12128 setlocale() involves two-letter language and country abbreviations, e.g. | |
12129 ja or ja_jp or ja_jp.euc for Japanese. Windows (MSDN article "Language | |
12130 Strings" in the run-time reference appendix, see doc list above) speaks | |
12131 of "(primary) language" and "sublanguage" (usually a country, but in the | |
12132 case of Chinese the sublanguage is "simplified" or "traditional"). It | |
12133 is highly flexible in what it takes, and thankfully it canonicalizes the | |
12134 result to a unique form "Language_Country.Encoding". It allows (note | |
12135 that all specifications can be in any case): | |
12136 | |
12137 @itemize @bullet | |
12138 @item | |
12139 the full "language_country.encoding" specification or just | |
12140 language_country", in which case the default encoding will be chosen. | |
12141 | |
12142 @item | |
12143 a three-letter acronym, consisting of the ISO-standard two-letter | |
12144 language abbreviation followed by a third letter indicating the | |
12145 sublanguage. | |
12146 | |
12147 @item | |
12148 just a language name, e.g. "dutch", standing for the combination of | |
12149 the language with "default" as sublanguage, referring to the default | |
12150 (often "prototypical") country for that language (in this case the | |
12151 Netherlands). You can abbreviate the name by removing any number of | |
12152 letters from the end. Ambiguity is not a problem: Even specifying | |
12153 just a single letter is valid providing any language starting with | |
12154 that letter exists, but the result may not be what you want (e.g. "c" | |
12155 maps to "catalan", not "chinese", "czech", etc.). The way of | |
12156 resolving ambiguity appears fairly random -- it's not alphabetical | |
12157 ("a" maps to "arabic" not "albanian"). | |
12158 | |
12159 @item | |
12160 a combination of language and sublanguage separated by a hyphen, | |
12161 e.g. "dutch-belgian"; note that the sublanguage designator in this | |
12162 case is NOT necessarily the same as the country, e.g. "belgian" vs. | |
12163 "belgium". "dutch-belgium" (or even "dutch-belg") does @strong{NOT} get you | |
12164 the right result, but returns "Dutch_Netherlands.1252" instead! This | |
12165 is because, although you may not abbreviate the result, Windows | |
12166 accepts any unknown value in the sublanguage field and treats it as | |
12167 equivalent to "default". Note also that the if the sublanguage name | |
12168 has underscores in it, you need to change them to spaces, e.g. | |
12169 "spanish-dominican republic". | |
12170 | |
12171 @item | |
12172 sometimes, just a sublanguage name, e.g. "belgian", standing for | |
12173 the combination of one of the languages spoken in that region and | |
12174 the sublanguage of the region -- in this case Dutch. Note that | |
12175 there is no guarantee of "protypicality" in this case in choice of | |
12176 language! You could hardly say that Dutch (aka Flemish) is more | |
12177 prototypical of Belgium than French. You cannot abbreviate this | |
12178 form, if it's allowed at all. | |
12179 @end itemize | |
12180 | |
12181 In addition: | |
12182 | |
12183 @itemize @bullet | |
12184 @item | |
12185 note further that you are not limited to the language/sublanguage | |
12186 combinations predefined by Windows. You can set weird combinations | |
12187 like "Chinese_Kenya.1255" (Chinese spoken in Kenya, represented by | |
12188 Windows-1255, i.e. Hebrew!) and Windows don't complain, despite the | |
12189 language-encoding inconsistency. You can also make up a weird | |
12190 combination and leave out the encoding, e.g. "Chinese_Qatar", which | |
12191 maps to "Chinese_Qatar.1256", where Windows-1256 is Arabic -- i.e. it | |
12192 appears to be choosing the encoding based on a default for the | |
12193 country. | |
12194 | |
12195 @item | |
12196 note also that the names for countries are often not what you expect. | |
12197 "urdu_pakistan" fails, and just "urdu" shows why, as it maps to | |
12198 "Urdu_Islamic Republic of Pakistan.1256". That is, some countries | |
12199 exist in their full name, and the canonicalized form with underscore | |
12200 is not very forgiving in its handling of country specifications. | |
12201 Similarly, Uzbekistan is "Republic of Uzbekistan", and "China" is | |
12202 "People's Republic of China" -- but in this latter case, unlike the | |
12203 other two, just "China" works as an alias, e.g. "uzbek_china" maps | |
12204 to "Uzbek_People's Republic of China.936". | |
12205 | |
12206 @item | |
12207 note that just the two-letter ISO language code is NOT allowed. | |
12208 Sometimes you'll get lucky (e.g. "fr" does map to "france"), but | |
12209 sometimes you'll get no match (e.g. "pl"), and sometimes you'll get | |
12210 really unlucky in that the call will succeed but with the wrong | |
12211 language (e.g. "es" maps to "estonian", not "spanish"). | |
12212 @end itemize | |
12213 | |
12214 As an example, MSDN article "Language Strings" indicates that German | |
12215 (default) can be specified using "deu" or "german"; German (Austrian) | |
12216 with "dea" or "german-austrian"; German (Swiss) with "des", | |
12217 "german-swiss", or "swiss"; French (Swiss) with "french-swiss" or "frs"; | |
12218 and English (USA) with "american", "american english", | |
12219 "american-english", "english-american", "english-us", "english-usa", | |
12220 "enu", "us", or "usa". This is not, of course, an exhaustive list even | |
12221 for just the given locales -- just "english" works in practice because | |
12222 English (Default) maps to English (USA). (#### Is this always the case?) | |
12223 | |
12224 Given the canonicalization, we don't have to worry too much about the | |
12225 different kinds of inputs to setlocale() -- unlike for Unix, where no | |
12226 canonicalization is usually performed, the particular locales that | |
12227 exist vary tremendously from OS to OS, and we need to parse the | |
12228 uncanonicalized locale spec, directly from the user, to figure out the | |
12229 encoding to use, making various guesses if not enough information is | |
12230 present. Yuck! The tricky thing under Windows is figuring how to | |
12231 deal with the sublang. It appears that the trick of simply passing the | |
12232 text of the manifest constant itself of the sublang, with appropriate | |
12233 hacking (e.g. of underscore to space), works most of the time. | |
12234 | |
12235 @node Random other Windows I18N docs, , The format of the locale in setlocale(), Microsoft Windows-Related Multilingual Issues | |
12236 @subsection Random other Windows I18N docs | |
12237 @cindex random other windows i18n docs | |
12238 | |
12239 Introduction to Internationalization Issues in the Win32 API | |
12240 | |
12241 Abstract: This page provides an overview of the aspects of the Win32 | |
12242 internationalization API that are relevant to XEmacs, including the | |
12243 basic distinction between multibyte and Unicode encodings. Also | |
12244 included are pointers to how XEmacs should make use of this API. | |
12245 | |
12246 The Win32 API is quite well-designed in its handling of strings | |
12247 encoded for various character sets. The API is geared around the idea | |
12248 that two different methods of encoding strings should be | |
12249 supported. These methods are called multibyte and Unicode, | |
12250 respectively. The multibyte encoding is compatible with ASCII strings | |
12251 and is a more efficient representation when dealing with strings | |
12252 containing primarily ASCII characters, but it has a great number of | |
12253 serious deficiencies and limitations, including that it is very | |
12254 difficult and error-prone to work with strings in this encoding, and | |
12255 any particular string in a multibyte encoding can only contain | |
12256 characters from a very limited number of character sets. The Unicode | |
12257 encoding rectifies all of these deficiencies, but it is not compatible | |
12258 with ASCII strings (in other words, an existing program will not be | |
12259 able to handle the encoded strings unless it is explicitly modified to | |
12260 do so), and it takes up twice as much memory space as multibyte | |
12261 encodings when encoding a purely ASCII string. | |
12262 | |
12263 Multibyte encodings use a variable number of bytes (either one or two) | |
12264 to represent characters. ASCII characters are also represented by a | |
12265 single byte with its high bit not set, and non-ASCII characters are | |
12266 represented by one or two bytes, the first of which always has its | |
12267 high bit set. (The second byte, when it exists, may or may not have | |
12268 its high bit set.) There is no single multibyte encoding. Instead, | |
12269 there is generally one encoding per non-ASCII character set. Such an | |
12270 encoding is capable of representing (besides ASCII characters, of | |
12271 course) only characters from one (or possibly two) particular | |
12272 character sets. | |
12273 | |
12274 Multibyte encoding makes processing of strings very difficult. For | |
12275 example, given a pointer to the beginning of a character within a | |
12276 string, finding the pointer to the beginning of the previous character | |
12277 may require backing up all the way to the beginning of the string, and | |
12278 then moving forward. Also, an operation such as separating out the | |
12279 components of a path by searching for backslashes will fail if it's | |
12280 implemented in the simplest (but not multibyte-aware) fashion, because | |
12281 it may find what appears to be a backslash, but which is actually the | |
12282 second byte of a two-byte character. Also, the limited number of | |
12283 character sets that any particular multibyte encoding can represent | |
12284 means that loss of data is likely if a string is converted from the | |
12285 XEmacs internal format into a multibyte format. | |
12286 | |
12287 For these reasons, the C code in XEmacs should never do any sort of | |
12288 work with multibyte encoded strings (or with strings in any external | |
12289 encoding for that matter). Strings should always be maintained in the | |
12290 internal encoding, which is predictable, and converted to an external | |
12291 encoding only at the point where the string moves from the XEmacs C | |
12292 code and enters a system library function. Similarly, when a string is | |
12293 returned from a system library function, it should be immediately | |
12294 converted into the internal coding before any operations are done on | |
12295 it. | |
12296 | |
12297 Unicode, unlike multibyte encodings, is a fixed-width encoding where | |
12298 every character is represented using 16 bits. It is also capable of | |
12299 encoding all the characters from all the character sets in common use | |
12300 in the world. The predictability and completeness of the Unicode | |
12301 encoding makes it a very good encoding for strings that may contain | |
12302 characters from many character sets mixed up with each other. At the | |
12303 same time, of course, it is incompatible with routines that expect | |
12304 ASCII characters and also incompatible with general string | |
12305 manipulation routines, which will encounter a great number of what | |
12306 would appear to be embedded nulls in the string. It also takes twice | |
12307 as much room to encode strings containing primarily ASCII | |
12308 characters. This is why XEmacs does not use Unicode or similar | |
12309 encoding internally for buffers. | |
12310 | |
12311 The Win32 API cleverly deals with the issue of 8 bit vs. 16 bit | |
12312 characters by declaring a type called TCHAR which specifies a generic | |
12313 character, either 8 bits or 16 bits. Generally TCHAR is defined to be | |
12314 the same as the simple C type char, unless the preprocessor constant | |
12315 UNICODE is defined, in which case TCHAR is defined to be WCHAR, which | |
12316 is a 16 bit type. Nearly all functions in the Win32 API that take | |
12317 strings are defined to take strings that are actually arrays of | |
12318 TCHARs. There is a type LPTSTR which is defined to be a string of | |
12319 TCHARs and another type LPCTSTR which is a const string of TCHARs. The | |
12320 theory is that any program that uses TCHARs exclusively to represent | |
12321 characters and does not make assumptions about the size of a TCHAR or | |
12322 the way that the characters are encoded should work transparently | |
12323 regardless of whether the UNICODE preprocessor constant is defined, | |
12324 which is to say, regardless of whether 8 bit multibyte or 16 bit | |
12325 Unicode characters are being used. The way that this is actually | |
12326 implemented is that every Win32 API function that takes a string as an | |
12327 argument actually maps to one of two functions which are suffixed with | |
12328 an A (which stands for ANSI, and means multibyte strings) or W (which | |
12329 stands for wide, and means Unicode strings). The mapping is, of | |
12330 course, controlled by the same UNICODE preprocessor | |
12331 constant. Generally all structures containing strings in them actually | |
12332 map to one of two different kinds of structures, with either an A or a | |
12333 W suffix after the structure name. | |
12334 | |
12335 Unfortunately, not all of the implementations of the Win32 API | |
12336 implement all of the functionality described above. In particular, | |
12337 Windows 95 does not implement very much Unicode functionality. It does | |
12338 implement functions to convert multibyte-encoded strings to and from | |
12339 Unicode strings, and provides Unicode versions of certain low-level | |
12340 functions like ExtTextOut(). In fact, all of the rest of the Unicode | |
12341 versions of API functions are just stubs that return an | |
12342 error. Conversely, all versions of Windows NT completely implement all | |
12343 the Unicode functionality, but some versions (especially versions | |
12344 before Windows NT 4.0) don't implement much of the multibyte | |
12345 functionality. For this reason, as well as for general code | |
12346 cleanliness, XEmacs needs to be written in such a way that it works | |
12347 with or without the UNICODE preprocessor constant being defined. | |
12348 | |
12349 Getting XEmacs to run when all strings are Unicode primarily involves | |
12350 removing any assumptions made about the size of characters. Remember | |
12351 what I said earlier about how the point of conversion between | |
12352 internally and externally encoded strings should occur at the point of | |
12353 entry or exit into or out of a library function. With this in mind, an | |
12354 externally encoded string in XEmacs can be treated simply as an | |
12355 arbitrary sequence of bytes of some length which has no particular | |
12356 relationship to the length of the string in the internal encoding. | |
12357 | |
12358 Use Qnative for Unix conversion, Qmswindows_tstr for Windows ... | |
12359 | |
12360 String constants that are to be passed directly to Win32 API functions, | |
12361 such as the names of window classes, need to be bracketed in their | |
12362 definition with a call to the macro XETEXT. This appropriately makes a | |
12363 string of either regular or wide chars, which is to say this string may be | |
12364 prepended with an L (causing it to be a wide string) depending on | |
12365 XEUNICODE_P. | |
12366 | |
12367 @node Modules for Internationalization, , Microsoft Windows-Related Multilingual Issues, Multilingual Support | |
12368 @section Modules for Internationalization | |
12369 @cindex modules for internationalization | |
12370 @cindex internationalization, modules for | |
12371 | |
12372 @example | |
12373 @file{mule-canna.c} | |
12374 @file{mule-ccl.c} | |
12375 @file{mule-charset.c} | |
12376 @file{mule-charset.h} | |
12377 @file{file-coding.c} | |
12378 @file{file-coding.h} | |
12379 @file{mule-coding.c} | |
12380 @file{mule-mcpath.c} | |
12381 @file{mule-mcpath.h} | |
12382 @file{mule-wnnfns.c} | |
12383 @file{mule.c} | |
12384 @end example | |
12385 | |
12386 These files implement the MULE (Asian-language) support. Note that MULE | |
12387 actually provides a general interface for all sorts of languages, not | |
12388 just Asian languages (although they are generally the most complicated | |
12389 to support). This code is still in beta. | |
12390 | |
12391 @file{mule-charset.*} and @file{file-coding.*} provide the heart of the | |
12392 XEmacs MULE support. @file{mule-charset.*} implements the @dfn{charset} | |
12393 Lisp object type, which encapsulates a character set (an ordered one- or | |
12394 two-dimensional set of characters, such as US ASCII or JISX0208 Japanese | |
12395 Kanji). | |
12396 | |
12397 @file{file-coding.*} implements the @dfn{coding-system} Lisp object | |
12398 type, which encapsulates a method of converting between different | |
12399 encodings. An encoding is a representation of a stream of characters, | |
12400 possibly from multiple character sets, using a stream of bytes or words, | |
12401 and defines (e.g.) which escape sequences are used to specify particular | |
12402 character sets, how the indices for a character are converted into bytes | |
12403 (sometimes this involves setting the high bit; sometimes complicated | |
12404 rearranging of the values takes place, as in the Shift-JIS encoding), | |
12405 etc. It also contains some generic coding system implementations, such | |
12406 as the binary (no-conversion) coding system and a sample gzip coding system. | |
12407 | |
12408 @file{mule-coding.c} contains the implementations of text coding systems. | |
12409 | |
12410 @file{mule-ccl.c} provides the CCL (Code Conversion Language) | |
12411 interpreter. CCL is similar in spirit to Lisp byte code and is used to | |
12412 implement converters for custom encodings. | |
12413 | |
12414 @file{mule-canna.c} and @file{mule-wnnfns.c} implement interfaces to | |
12415 external programs used to implement the Canna and WNN input methods, | |
12416 respectively. This is currently in beta. | |
12417 | |
12418 @file{mule-mcpath.c} provides some functions to allow for pathnames | |
12419 containing extended characters. This code is fragmentary, obsolete, and | |
12420 completely non-working. Instead, @code{pathname-coding-system} is used | |
12421 to specify conversions of names of files and directories. The standard | |
12422 C I/O functions like @samp{open()} are wrapped so that conversion occurs | |
12423 automatically. | |
12424 | |
12425 @file{mule.c} contains a few miscellaneous things. It currently seems | |
12426 to be unused and probably should be removed. | |
12427 | |
12428 | |
12429 | |
12430 @example | |
12431 @file{intl.c} | |
12432 @end example | |
12433 | |
12434 This provides some miscellaneous internationalization code for | |
12435 implementing message translation and interfacing to the Ximp input | |
12436 method. None of this code is currently working. | |
12437 | |
12438 | |
12439 | |
12440 @example | |
12441 @file{iso-wide.h} | |
12442 @end example | |
12443 | |
12444 This contains leftover code from an earlier implementation of | |
12445 Asian-language support, and is not currently used. | |
12446 | |
12447 | |
12448 @node Consoles; Devices; Frames; Windows, The Redisplay Mechanism, Multilingual Support, Top | |
12449 @chapter Consoles; Devices; Frames; Windows | |
12450 @cindex consoles; devices; frames; windows | |
12451 @cindex devices; frames; windows, consoles; | |
12452 @cindex frames; windows, consoles; devices; | |
12453 @cindex windows, consoles; devices; frames; | |
12454 | |
12455 @menu | |
12456 * Introduction to Consoles; Devices; Frames; Windows:: | |
12457 * Point:: | |
12458 * Window Hierarchy:: | |
12459 * The Window Object:: | |
12460 * Modules for the Basic Displayable Lisp Objects:: | |
12461 @end menu | |
12462 | |
12463 @node Introduction to Consoles; Devices; Frames; Windows, Point, Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows | |
12464 @section Introduction to Consoles; Devices; Frames; Windows | |
12465 @cindex consoles; devices; frames; windows, introduction to | |
12466 @cindex devices; frames; windows, introduction to consoles; | |
12467 @cindex frames; windows, introduction to consoles; devices; | |
12468 @cindex windows, introduction to consoles; devices; frames; | |
12469 | |
12470 A window-system window that you see on the screen is called a | |
12471 @dfn{frame} in Emacs terminology. Each frame is subdivided into one or | |
12472 more non-overlapping panes, called (confusingly) @dfn{windows}. Each | |
12473 window displays the text of a buffer in it. (See above on Buffers.) Note | |
12474 that buffers and windows are independent entities: Two or more windows | |
12475 can be displaying the same buffer (potentially in different locations), | |
12476 and a buffer can be displayed in no windows. | |
12477 | |
12478 A single display screen that contains one or more frames is called | |
12479 a @dfn{display}. Under most circumstances, there is only one display. | |
12480 However, more than one display can exist, for example if you have | |
12481 a @dfn{multi-headed} console, i.e. one with a single keyboard but | |
12482 multiple displays. (Typically in such a situation, the various | |
12483 displays act like one large display, in that the mouse is only | |
12484 in one of them at a time, and moving the mouse off of one moves | |
12485 it into another.) In some cases, the different displays will | |
12486 have different characteristics, e.g. one color and one mono. | |
12487 | |
12488 XEmacs can display frames on multiple displays. It can even deal | |
12489 simultaneously with frames on multiple keyboards (called @dfn{consoles} in | |
12490 XEmacs terminology). Here is one case where this might be useful: You | |
12491 are using XEmacs on your workstation at work, and leave it running. | |
12492 Then you go home and dial in on a TTY line, and you can use the | |
12493 already-running XEmacs process to display another frame on your local | |
12494 TTY. | |
12495 | |
12496 Thus, there is a hierarchy console -> display -> frame -> window. | |
12497 There is a separate Lisp object type for each of these four concepts. | |
12498 Furthermore, there is logically a @dfn{selected console}, | |
12499 @dfn{selected display}, @dfn{selected frame}, and @dfn{selected window}. | |
12500 Each of these objects is distinguished in various ways, such as being the | |
12501 default object for various functions that act on objects of that type. | |
12502 Note that every containing object remembers the ``selected'' object | |
12503 among the objects that it contains: e.g. not only is there a selected | |
12504 window, but every frame remembers the last window in it that was | |
12505 selected, and changing the selected frame causes the remembered window | |
12506 within it to become the selected window. Similar relationships apply | |
12507 for consoles to devices and devices to frames. | |
12508 | |
12509 @node Point, Window Hierarchy, Introduction to Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows | |
12510 @section Point | |
12511 @cindex point | |
12512 | |
12513 Recall that every buffer has a current insertion position, called | |
12514 @dfn{point}. Now, two or more windows may be displaying the same buffer, | |
12515 and the text cursor in the two windows (i.e. @code{point}) can be in | |
12516 two different places. You may ask, how can that be, since each | |
12517 buffer has only one value of @code{point}? The answer is that each window | |
12518 also has a value of @code{point} that is squirreled away in it. There | |
12519 is only one selected window, and the value of ``point'' in that buffer | |
12520 corresponds to that window. When the selected window is changed | |
12521 from one window to another displaying the same buffer, the old | |
12522 value of @code{point} is stored into the old window's ``point'' and the | |
12523 value of @code{point} from the new window is retrieved and made the | |
12524 value of @code{point} in the buffer. This means that @code{window-point} | |
12525 for the selected window is potentially inaccurate, and if you | |
12526 want to retrieve the correct value of @code{point} for a window, | |
12527 you must special-case on the selected window and retrieve the | |
12528 buffer's point instead. This is related to why @code{save-window-excursion} | |
12529 does not save the selected window's value of @code{point}. | |
12530 | |
12531 @node Window Hierarchy, The Window Object, Point, Consoles; Devices; Frames; Windows | |
12532 @section Window Hierarchy | |
12533 @cindex window hierarchy | |
12534 @cindex hierarchy of windows | |
12535 | |
12536 If a frame contains multiple windows (panes), they are always created | |
12537 by splitting an existing window along the horizontal or vertical axis. | |
12538 Terminology is a bit confusing here: to @dfn{split a window | |
12539 horizontally} means to create two side-by-side windows, i.e. to make a | |
12540 @emph{vertical} cut in a window. Likewise, to @dfn{split a window | |
12541 vertically} means to create two windows, one above the other, by making | |
12542 a @emph{horizontal} cut. | |
12543 | |
12544 If you split a window and then split again along the same axis, you | |
12545 will end up with a number of panes all arranged along the same axis. | |
12546 The precise way in which the splits were made should not be important, | |
12547 and this is reflected internally. Internally, all windows are arranged | |
12548 in a tree, consisting of two types of windows, @dfn{combination} windows | |
12549 (which have children, and are covered completely by those children) and | |
12550 @dfn{leaf} windows, which have no children and are visible. Every | |
12551 combination window has two or more children, all arranged along the same | |
12552 axis. There are (logically) two subtypes of windows, depending on | |
12553 whether their children are horizontally or vertically arrayed. There is | |
12554 always one root window, which is either a leaf window (if the frame | |
12555 contains only one window) or a combination window (if the frame contains | |
12556 more than one window). In the latter case, the root window will have | |
12557 two or more children, either horizontally or vertically arrayed, and | |
12558 each of those children will be either a leaf window or another | |
12559 combination window. | |
12560 | |
12561 Here are some rules: | |
12562 | |
12563 @enumerate | |
12564 @item | |
12565 Horizontal combination windows can never have children that are | |
12566 horizontal combination windows; same for vertical. | |
12567 | |
12568 @item | |
12569 Only leaf windows can be split (obviously) and this splitting does one | |
12570 of two things: (a) turns the leaf window into a combination window and | |
12571 creates two new leaf children, or (b) turns the leaf window into one of | |
12572 the two new leaves and creates the other leaf. Rule (1) dictates which | |
12573 of these two outcomes happens. | |
12574 | |
12575 @item | |
12576 Every combination window must have at least two children. | |
12577 | |
12578 @item | |
12579 Leaf windows can never become combination windows. They can be deleted, | |
12580 however. If this results in a violation of (3), the parent combination | |
12581 window also gets deleted. | |
12582 | |
12583 @item | |
12584 All functions that accept windows must be prepared to accept combination | |
12585 windows, and do something sane (e.g. signal an error if so). | |
12586 Combination windows @emph{do} escape to the Lisp level. | |
12587 | |
12588 @item | |
12589 All windows have three fields governing their contents: | |
12590 these are @dfn{hchild} (a list of horizontally-arrayed children), | |
12591 @dfn{vchild} (a list of vertically-arrayed children), and @dfn{buffer} | |
12592 (the buffer contained in a leaf window). Exactly one of | |
12593 these will be non-@code{nil}. Remember that @dfn{horizontally-arrayed} | |
12594 means ``side-by-side'' and @dfn{vertically-arrayed} means | |
12595 @dfn{one above the other}. | |
12596 | |
12597 @item | |
12598 Leaf windows also have markers in their @code{start} (the | |
12599 first buffer position displayed in the window) and @code{pointm} | |
12600 (the window's stashed value of @code{point}---see above) fields, | |
12601 while combination windows have @code{nil} in these fields. | |
12602 | |
12603 @item | |
12604 The list of children for a window is threaded through the | |
12605 @code{next} and @code{prev} fields of each child window. | |
12606 | |
12607 @item | |
12608 @strong{Deleted windows can be undeleted}. This happens as a result of | |
12609 restoring a window configuration, and is unlike frames, displays, and | |
12610 consoles, which, once deleted, can never be restored. Deleting a window | |
12611 does nothing except set a special @code{dead} bit to 1 and clear out the | |
12612 @code{next}, @code{prev}, @code{hchild}, and @code{vchild} fields, for | |
12613 GC purposes. | |
12614 | |
12615 @item | |
12616 Most frames actually have two top-level windows---one for the | |
12617 minibuffer and one (the @dfn{root}) for everything else. The modeline | |
12618 (if present) separates these two. The @code{next} field of the root | |
12619 points to the minibuffer, and the @code{prev} field of the minibuffer | |
12620 points to the root. The other @code{next} and @code{prev} fields are | |
12621 @code{nil}, and the frame points to both of these windows. | |
12622 Minibuffer-less frames have no minibuffer window, and the @code{next} | |
12623 and @code{prev} of the root window are @code{nil}. Minibuffer-only | |
12624 frames have no root window, and the @code{next} of the minibuffer window | |
12625 is @code{nil} but the @code{prev} points to itself. (#### This is an | |
12626 artifact that should be fixed.) | |
12627 @end enumerate | |
12628 | |
12629 @node The Window Object, Modules for the Basic Displayable Lisp Objects, Window Hierarchy, Consoles; Devices; Frames; Windows | |
12630 @section The Window Object | |
12631 @cindex window object, the | |
12632 @cindex object, the window | |
12633 | |
12634 Windows have the following accessible fields: | |
12635 | |
12636 @table @code | |
12637 @item frame | |
12638 The frame that this window is on. | |
12639 | |
12640 @item mini_p | |
12641 Non-@code{nil} if this window is a minibuffer window. | |
12642 | |
12643 @item buffer | |
12644 The buffer that the window is displaying. This may change often during | |
12645 the life of the window. | |
12646 | |
12647 @item dedicated | |
12648 Non-@code{nil} if this window is dedicated to its buffer. | |
12649 | |
12650 @item pointm | |
12651 @cindex window point internals | |
12652 This is the value of point in the current buffer when this window is | |
12653 selected; when it is not selected, it retains its previous value. | |
12654 | |
12655 @item start | |
12656 The position in the buffer that is the first character to be displayed | |
12657 in the window. | |
12658 | |
12659 @item force_start | |
12660 If this flag is non-@code{nil}, it says that the window has been | |
12661 scrolled explicitly by the Lisp program. This affects what the next | |
12662 redisplay does if point is off the screen: instead of scrolling the | |
12663 window to show the text around point, it moves point to a location that | |
12664 is on the screen. | |
12665 | |
12666 @item last_modified | |
12667 The @code{modified} field of the window's buffer, as of the last time | |
12668 a redisplay completed in this window. | |
12669 | |
12670 @item last_point | |
12671 The buffer's value of point, as of the last time | |
12672 a redisplay completed in this window. | |
12673 | |
12674 @item left | |
12675 This is the left-hand edge of the window, measured in columns. (The | |
12676 leftmost column on the screen is @w{column 0}.) | |
12677 | |
12678 @item top | |
12679 This is the top edge of the window, measured in lines. (The top line on | |
12680 the screen is @w{line 0}.) | |
12681 | |
12682 @item height | |
12683 The height of the window, measured in lines. | |
12684 | |
12685 @item width | |
12686 The width of the window, measured in columns. | |
12687 | |
12688 @item next | |
12689 This is the window that is the next in the chain of siblings. It is | |
12690 @code{nil} in a window that is the rightmost or bottommost of a group of | |
12691 siblings. | |
12692 | |
12693 @item prev | |
12694 This is the window that is the previous in the chain of siblings. It is | |
12695 @code{nil} in a window that is the leftmost or topmost of a group of | |
12696 siblings. | |
12697 | |
12698 @item parent | |
12699 Internally, XEmacs arranges windows in a tree; each group of siblings has | |
12700 a parent window whose area includes all the siblings. This field points | |
12701 to a window's parent. | |
12702 | |
12703 Parent windows do not display buffers, and play little role in display | |
12704 except to shape their child windows. Emacs Lisp programs usually have | |
12705 no access to the parent windows; they operate on the windows at the | |
12706 leaves of the tree, which actually display buffers. | |
12707 | |
12708 @item hscroll | |
12709 This is the number of columns that the display in the window is scrolled | |
12710 horizontally to the left. Normally, this is 0. | |
12711 | |
12712 @item use_time | |
12713 This is the last time that the window was selected. The function | |
12714 @code{get-lru-window} uses this field. | |
12715 | |
12716 @item display_table | |
12717 The window's display table, or @code{nil} if none is specified for it. | |
12718 | |
12719 @item update_mode_line | |
12720 Non-@code{nil} means this window's mode line needs to be updated. | |
12721 | |
12722 @item base_line_number | |
12723 The line number of a certain position in the buffer, or @code{nil}. | |
12724 This is used for displaying the line number of point in the mode line. | |
12725 | |
12726 @item base_line_pos | |
12727 The position in the buffer for which the line number is known, or | |
12728 @code{nil} meaning none is known. | |
12729 | |
12730 @item region_showing | |
12731 If the region (or part of it) is highlighted in this window, this field | |
12732 holds the mark position that made one end of that region. Otherwise, | |
12733 this field is @code{nil}. | |
12734 @end table | |
12735 | |
12736 @node Modules for the Basic Displayable Lisp Objects, , The Window Object, Consoles; Devices; Frames; Windows | |
12737 @section Modules for the Basic Displayable Lisp Objects | |
12738 @cindex modules for the basic displayable Lisp objects | |
12739 @cindex displayable Lisp objects, modules for the basic | |
12740 @cindex Lisp objects, modules for the basic displayable | |
12741 @cindex objects, modules for the basic displayable Lisp | |
12742 | |
12743 @example | |
12744 @file{console-msw.c} | |
12745 @file{console-msw.h} | |
12746 @file{console-stream.c} | |
12747 @file{console-stream.h} | |
12748 @file{console-tty.c} | |
12749 @file{console-tty.h} | |
12750 @file{console-x.c} | |
12751 @file{console-x.h} | |
12752 @file{console.c} | |
12753 @file{console.h} | |
12754 @end example | |
12755 | |
12756 These modules implement the @dfn{console} Lisp object type. A console | |
12757 contains multiple display devices, but only one keyboard and mouse. | |
12758 Most of the time, a console will contain exactly one device. | |
12759 | |
12760 Consoles are the top of a lisp object inclusion hierarchy. Consoles | |
12761 contain devices, which contain frames, which contain windows. | |
12762 | |
12763 | |
12764 | |
12765 @example | |
12766 @file{device-msw.c} | |
12767 @file{device-tty.c} | |
12768 @file{device-x.c} | |
12769 @file{device.c} | |
12770 @file{device.h} | |
12771 @end example | |
12772 | |
12773 These modules implement the @dfn{device} Lisp object type. This | |
12774 abstracts a particular screen or connection on which frames are | |
12775 displayed. As with Lisp objects, event interfaces, and other | |
12776 subsystems, the device code is separated into a generic component that | |
12777 contains a standardized interface (in the form of a set of methods) onto | |
12778 particular device types. | |
12779 | |
12780 The device subsystem defines all the methods and provides method | |
12781 services for not only device operations but also for the frame, window, | |
12782 menubar, scrollbar, toolbar, and other displayable-object subsystems. | |
12783 The reason for this is that all of these subsystems have the same | |
12784 subtypes (X, TTY, NeXTstep, Microsoft Windows, etc.) as devices do. | |
12785 | |
12786 | |
12787 | |
12788 @example | |
12789 @file{frame-msw.c} | |
12790 @file{frame-tty.c} | |
12791 @file{frame-x.c} | |
12792 @file{frame.c} | |
12793 @file{frame.h} | |
12794 @end example | |
12795 | |
12796 Each device contains one or more frames in which objects (e.g. text) are | |
12797 displayed. A frame corresponds to a window in the window system; | |
12798 usually this is a top-level window but it could potentially be one of a | |
12799 number of overlapping child windows within a top-level window, using the | |
12800 MDI (Multiple Document Interface) protocol in Microsoft Windows or a | |
12801 similar scheme. | |
12802 | |
12803 The @file{frame-*} files implement the @dfn{frame} Lisp object type and | |
12804 provide the generic and device-type-specific operations on frames | |
12805 (e.g. raising, lowering, resizing, moving, etc.). | |
12806 | |
12807 | |
12808 | |
12809 @example | |
12810 @file{window.c} | |
12811 @file{window.h} | |
12812 @end example | |
12813 | |
12814 @cindex window (in Emacs) | |
12815 @cindex pane | |
12816 Each frame consists of one or more non-overlapping @dfn{windows} (better | |
12817 known as @dfn{panes} in standard window-system terminology) in which a | |
12818 buffer's text can be displayed. Windows can also have scrollbars | |
12819 displayed around their edges. | |
12820 | |
12821 @file{window.c} and @file{window.h} implement the @dfn{window} Lisp | |
12822 object type and provide code to manage windows. Since windows have no | |
12823 associated resources in the window system (the window system knows only | |
12824 about the frame; no child windows or anything are used for XEmacs | |
12825 windows), there is no device-type-specific code here; all of that code | |
12826 is part of the redisplay mechanism or the code for particular object | |
12827 types such as scrollbars. | |
12828 | |
12829 @node The Redisplay Mechanism, Extents, Consoles; Devices; Frames; Windows, Top | |
12830 @chapter The Redisplay Mechanism | |
12831 @cindex redisplay mechanism, the | |
12832 | |
12833 The redisplay mechanism is one of the most complicated sections of | |
12834 XEmacs, especially from a conceptual standpoint. This is doubly so | |
12835 because, unlike for the basic aspects of the Lisp interpreter, the | |
12836 computer science theories of how to efficiently handle redisplay are not | |
12837 well-developed. | |
12838 | |
12839 When working with the redisplay mechanism, remember the Golden Rules | |
12840 of Redisplay: | |
12841 | |
12842 @enumerate | |
12843 @item | |
12844 It Is Better To Be Correct Than Fast. | |
12845 @item | |
12846 Thou Shalt Not Run Elisp From Within Redisplay. | |
12847 @item | |
12848 It Is Better To Be Fast Than Not To Be. | |
12849 @end enumerate | |
12850 | |
12851 @menu | |
12852 * Critical Redisplay Sections:: | |
12853 * Line Start Cache:: | |
12854 * Redisplay Piece by Piece:: | |
12855 * Modules for the Redisplay Mechanism:: | |
12856 * Modules for other Display-Related Lisp Objects:: | |
12857 @end menu | |
12858 | |
12859 @node Critical Redisplay Sections, Line Start Cache, The Redisplay Mechanism, The Redisplay Mechanism | |
12860 @section Critical Redisplay Sections | |
12861 @cindex redisplay sections, critical | |
12862 @cindex critical redisplay sections | |
12863 | |
12864 Within this section, we are defenseless and assume that the | |
12865 following cannot happen: | |
12866 | |
12867 @enumerate | |
12868 @item | |
12869 garbage collection | |
12870 @item | |
12871 Lisp code evaluation | |
12872 @item | |
12873 frame size changes | |
12874 @end enumerate | |
12875 | |
12876 We ensure (3) by calling @code{hold_frame_size_changes()}, which | |
12877 will cause any pending frame size changes to get put on hold | |
12878 till after the end of the critical section. (1) follows | |
12879 automatically if (2) is met. #### Unfortunately, there are | |
12880 some places where Lisp code can be called within this section. | |
12881 We need to remove them. | |
12882 | |
12883 If @code{Fsignal()} is called during this critical section, we | |
12884 will @code{abort()}. | |
12885 | |
12886 If garbage collection is called during this critical section, | |
12887 we simply return. #### We should abort instead. | |
12888 | |
12889 #### If a frame-size change does occur we should probably | |
12890 actually be preempting redisplay. | |
12891 | |
12892 @node Line Start Cache, Redisplay Piece by Piece, Critical Redisplay Sections, The Redisplay Mechanism | |
12893 @section Line Start Cache | |
12894 @cindex line start cache | |
12895 | |
12896 The traditional scrolling code in Emacs breaks in a variable height | |
12897 world. It depends on the key assumption that the number of lines that | |
12898 can be displayed at any given time is fixed. This led to a complete | |
12899 separation of the scrolling code from the redisplay code. In order to | |
12900 fully support variable height lines, the scrolling code must actually be | |
12901 tightly integrated with redisplay. Only redisplay can determine how | |
12902 many lines will be displayed on a screen for any given starting point. | |
12903 | |
12904 What is ideally wanted is a complete list of the starting buffer | |
12905 position for every possible display line of a buffer along with the | |
12906 height of that display line. Maintaining such a full list would be very | |
12907 expensive. We settle for having it include information for all areas | |
12908 which we happen to generate anyhow (i.e. the region currently being | |
12909 displayed) and for those areas we need to work with. | |
12910 | |
12911 In order to ensure that the cache accurately represents what redisplay | |
12912 would actually show, it is necessary to invalidate it in many | |
12913 situations. If the buffer changes, the starting positions may no longer | |
12914 be correct. If a face or an extent has changed then the line heights | |
12915 may have altered. These events happen frequently enough that the cache | |
12916 can end up being constantly disabled. With this potentially constant | |
12917 invalidation when is the cache ever useful? | |
12918 | |
12919 Even if the cache is invalidated before every single usage, it is | |
12920 necessary. Scrolling often requires knowledge about display lines which | |
12921 are actually above or below the visible region. The cache provides a | |
12922 convenient light-weight method of storing this information for multiple | |
12923 display regions. This knowledge is necessary for the scrolling code to | |
12924 always obey the First Golden Rule of Redisplay. | |
12925 | |
12926 If the cache already contains all of the information that the scrolling | |
12927 routines happen to need so that it doesn't have to go generate it, then | |
12928 we are able to obey the Third Golden Rule of Redisplay. The first thing | |
12929 we do to help out the cache is to always add the displayed region. This | |
12930 region had to be generated anyway, so the cache ends up getting the | |
12931 information basically for free. In those cases where a user is simply | |
12932 scrolling around viewing a buffer there is a high probability that this | |
12933 is sufficient to always provide the needed information. The second | |
12934 thing we can do is be smart about invalidating the cache. | |
12935 | |
12936 TODO---Be smart about invalidating the cache. Potential places: | |
12937 | |
12938 @itemize @bullet | |
12939 @item | |
12940 Insertions at end-of-line which don't cause line-wraps do not alter the | |
12941 starting positions of any display lines. These types of buffer | |
12942 modifications should not invalidate the cache. This is actually a large | |
12943 optimization for redisplay speed as well. | |
12944 @item | |
12945 Buffer modifications frequently only affect the display of lines at and | |
12946 below where they occur. In these situations we should only invalidate | |
12947 the part of the cache starting at where the modification occurs. | |
12948 @end itemize | |
12949 | |
12950 In case you're wondering, the Second Golden Rule of Redisplay is not | |
12951 applicable. | |
12952 | |
12953 @node Redisplay Piece by Piece, Modules for the Redisplay Mechanism, Line Start Cache, The Redisplay Mechanism | |
12954 @section Redisplay Piece by Piece | |
12955 @cindex redisplay piece by piece | |
12956 | |
12957 As you can begin to see redisplay is complex and also not well | |
12958 documented. Chuck no longer works on XEmacs so this section is my take | |
12959 on the workings of redisplay. | |
12960 | |
12961 Redisplay happens in three phases: | |
12962 | |
12963 @enumerate | |
12964 @item | |
12965 Determine desired display in area that needs redisplay. | |
12966 Implemented by @code{redisplay.c} | |
12967 @item | |
12968 Compare desired display with current display | |
12969 Implemented by @code{redisplay-output.c} | |
12970 @item | |
12971 Output changes Implemented by @code{redisplay-output.c}, | |
12972 @code{redisplay-x.c}, @code{redisplay-msw.c} and @code{redisplay-tty.c} | |
12973 @end enumerate | |
12974 | |
12975 Steps 1 and 2 are device-independent and relatively complex. Step 3 is | |
12976 mostly device-dependent. | |
12977 | |
12978 Determining the desired display | |
12979 | |
12980 Display attributes are stored in @code{display_line} structures. Each | |
12981 @code{display_line} consists of a set of @code{display_block}'s and each | |
12982 @code{display_block} contains a number of @code{rune}'s. Generally | |
12983 dynarr's of @code{display_line}'s are held by each window representing | |
12984 the current display and the desired display. | |
12985 | |
12986 The @code{display_line} structures are tightly tied to buffers which | |
12987 presents a problem for redisplay as this connection is bogus for the | |
12988 modeline. Hence the @code{display_line} generation routines are | |
12989 duplicated for generating the modeline. This means that the modeline | |
12990 display code has many bugs that the standard redisplay code does not. | |
12991 | |
12992 The guts of @code{display_line} generation are in | |
12993 @code{create_text_block}, which creates a single display line for the | |
12994 desired locale. This incrementally parses the characters on the current | |
12995 line and generates redisplay structures for each. | |
12996 | |
12997 Gutter redisplay is different. Because the data to display is stored in | |
12998 a string we cannot use @code{create_text_block}. Instead we use | |
12999 @code{create_text_string_block} which performs the same function as | |
13000 @code{create_text_block} but for strings. Many of the complexities of | |
13001 @code{create_text_block} to do with cursor handling and selective | |
13002 display have been removed. | |
13003 | |
13004 @node Modules for the Redisplay Mechanism, Modules for other Display-Related Lisp Objects, Redisplay Piece by Piece, The Redisplay Mechanism | |
13005 @section Modules for the Redisplay Mechanism | |
13006 @cindex modules for the redisplay mechanism | |
13007 @cindex redisplay mechanism, modules for the | |
13008 | |
13009 @example | |
13010 @file{redisplay-output.c} | |
13011 @file{redisplay-msw.c} | |
13012 @file{redisplay-tty.c} | |
13013 @file{redisplay-x.c} | |
13014 @file{redisplay.c} | |
13015 @file{redisplay.h} | |
13016 @end example | |
13017 | |
13018 These files provide the redisplay mechanism. As with many other | |
13019 subsystems in XEmacs, there is a clean separation between the general | |
13020 and device-specific support. | |
13021 | |
13022 @file{redisplay.c} contains the bulk of the redisplay engine. These | |
13023 functions update the redisplay structures (which describe how the screen | |
13024 is to appear) to reflect any changes made to the state of any | |
13025 displayable objects (buffer, frame, window, etc.) since the last time | |
13026 that redisplay was called. These functions are highly optimized to | |
13027 avoid doing more work than necessary (since redisplay is called | |
13028 extremely often and is potentially a huge time sink), and depend heavily | |
13029 on notifications from the objects themselves that changes have occurred, | |
13030 so that redisplay doesn't explicitly have to check each possible object. | |
13031 The redisplay mechanism also contains a great deal of caching to further | |
13032 speed things up; some of this caching is contained within the various | |
13033 displayable objects. | |
13034 | |
13035 @file{redisplay-output.c} goes through the redisplay structures and converts | |
13036 them into calls to device-specific methods to actually output the screen | |
13037 changes. | |
13038 | |
13039 @file{redisplay-x.c} and @file{redisplay-tty.c} are two implementations | |
13040 of these redisplay output methods, for X frames and TTY frames, | |
13041 respectively. | |
13042 | |
13043 | |
13044 | |
13045 @example | |
13046 @file{indent.c} | |
13047 @end example | |
13048 | |
13049 This module contains various functions and Lisp primitives for | |
13050 converting between buffer positions and screen positions. These | |
13051 functions call the redisplay mechanism to do most of the work, and then | |
13052 examine the redisplay structures to get the necessary information. This | |
13053 module needs work. | |
13054 | |
13055 | |
13056 | |
13057 @example | |
13058 @file{termcap.c} | |
13059 @file{terminfo.c} | |
13060 @file{tparam.c} | |
13061 @end example | |
13062 | |
13063 These files contain functions for working with the termcap (BSD-style) | |
13064 and terminfo (System V style) databases of terminal capabilities and | |
13065 escape sequences, used when XEmacs is displaying in a TTY. | |
13066 | |
13067 | |
13068 | |
13069 @example | |
13070 @file{cm.c} | |
13071 @file{cm.h} | |
13072 @end example | |
13073 | |
13074 These files provide some miscellaneous TTY-output functions and should | |
13075 probably be merged into @file{redisplay-tty.c}. | |
13076 | |
13077 | |
13078 | |
13079 @node Modules for other Display-Related Lisp Objects, , Modules for the Redisplay Mechanism, The Redisplay Mechanism | |
13080 @section Modules for other Display-Related Lisp Objects | |
13081 @cindex modules for other display-related Lisp objects | |
13082 @cindex display-related Lisp objects, modules for other | |
13083 @cindex Lisp objects, modules for other display-related | |
13084 | |
13085 @example | |
13086 @file{faces.c} | |
13087 @file{faces.h} | |
13088 @end example | |
13089 | |
13090 | |
13091 | |
13092 @example | |
13093 @file{bitmaps.h} | |
13094 @file{glyphs-eimage.c} | |
13095 @file{glyphs-msw.c} | |
13096 @file{glyphs-msw.h} | |
13097 @file{glyphs-widget.c} | |
13098 @file{glyphs-x.c} | |
13099 @file{glyphs-x.h} | |
13100 @file{glyphs.c} | |
13101 @file{glyphs.h} | |
13102 @end example | |
13103 | |
13104 | |
13105 | |
13106 @example | |
13107 @file{objects-msw.c} | |
13108 @file{objects-msw.h} | |
13109 @file{objects-tty.c} | |
13110 @file{objects-tty.h} | |
13111 @file{objects-x.c} | |
13112 @file{objects-x.h} | |
13113 @file{objects.c} | |
13114 @file{objects.h} | |
13115 @end example | |
13116 | |
13117 | |
13118 | |
13119 @example | |
13120 @file{menubar-msw.c} | |
13121 @file{menubar-msw.h} | |
13122 @file{menubar-x.c} | |
13123 @file{menubar.c} | |
13124 @file{menubar.h} | |
13125 @end example | |
13126 | |
13127 | |
13128 | |
13129 @example | |
13130 @file{scrollbar-msw.c} | |
13131 @file{scrollbar-msw.h} | |
13132 @file{scrollbar-x.c} | |
13133 @file{scrollbar-x.h} | |
13134 @file{scrollbar.c} | |
13135 @file{scrollbar.h} | |
13136 @end example | |
13137 | |
13138 | |
13139 | |
13140 @example | |
13141 @file{toolbar-msw.c} | |
13142 @file{toolbar-x.c} | |
13143 @file{toolbar.c} | |
13144 @file{toolbar.h} | |
13145 @end example | |
13146 | |
13147 | |
13148 | |
13149 @example | |
13150 @file{font-lock.c} | |
13151 @end example | |
13152 | |
13153 This file provides C support for syntax highlighting---i.e. | |
13154 highlighting different syntactic constructs of a source file in | |
13155 different colors, for easy reading. The C support is provided so that | |
13156 this is fast. | |
13157 | |
13158 | |
13159 | |
13160 @example | |
13161 @file{dgif_lib.c} | |
13162 @file{gif_err.c} | |
13163 @file{gif_lib.h} | |
13164 @file{gifalloc.c} | |
13165 @end example | |
13166 | |
13167 These modules decode GIF-format image files, for use with glyphs. | |
13168 These files were removed due to Unisys patent infringement concerns. | |
13169 | |
13170 | |
13171 @node Extents, Faces, The Redisplay Mechanism, Top | |
13172 @chapter Extents | |
13173 @cindex extents | |
13174 | |
13175 @menu | |
13176 * Introduction to Extents:: Extents are ranges over text, with properties. | |
13177 * Extent Ordering:: How extents are ordered internally. | |
13178 * Format of the Extent Info:: The extent information in a buffer or string. | |
13179 * Zero-Length Extents:: A weird special case. | |
13180 * Mathematics of Extent Ordering:: A rigorous foundation. | |
13181 * Extent Fragments:: Cached information useful for redisplay. | |
13182 @end menu | |
13183 | |
13184 @node Introduction to Extents, Extent Ordering, Extents, Extents | |
13185 @section Introduction to Extents | |
13186 @cindex extents, introduction to | |
13187 | |
13188 Extents are regions over a buffer, with a start and an end position | |
13189 denoting the region of the buffer included in the extent. In | |
13190 addition, either end can be closed or open, meaning that the endpoint | |
13191 is or is not logically included in the extent. Insertion of a character | |
13192 at a closed endpoint causes the character to go inside the extent; | |
13193 insertion at an open endpoint causes the character to go outside. | |
13194 | |
13195 Extent endpoints are stored using memory indices (see @file{insdel.c}), | |
13196 to minimize the amount of adjusting that needs to be done when | |
13197 characters are inserted or deleted. | |
13198 | |
13199 (Formerly, extent endpoints at the gap could be either before or | |
13200 after the gap, depending on the open/closedness of the endpoint. | |
13201 The intent of this was to make it so that insertions would | |
13202 automatically go inside or out of extents as necessary with no | |
13203 further work needing to be done. It didn't work out that way, | |
13204 however, and just ended up complexifying and buggifying all the | |
13205 rest of the code.) | |
13206 | |
13207 @node Extent Ordering, Format of the Extent Info, Introduction to Extents, Extents | |
13208 @section Extent Ordering | |
13209 @cindex extent ordering | |
13210 | |
13211 Extents are compared using memory indices. There are two orderings | |
13212 for extents and both orders are kept current at all times. The normal | |
13213 or @dfn{display} order is as follows: | |
13214 | |
13215 @example | |
13216 Extent A is ``less than'' extent B, | |
13217 that is, earlier in the display order, | |
13218 if: A-start < B-start, | |
13219 or if: A-start = B-start, and A-end > B-end | |
13220 @end example | |
13221 | |
13222 So if two extents begin at the same position, the larger of them is the | |
13223 earlier one in the display order (@code{EXTENT_LESS} is true). | |
13224 | |
13225 For the e-order, the same thing holds: | |
13226 | |
13227 @example | |
13228 Extent A is ``less than'' extent B in e-order, | |
13229 that is, later in the buffer, | |
13230 if: A-end < B-end, | |
13231 or if: A-end = B-end, and A-start > B-start | |
13232 @end example | |
13233 | |
13234 So if two extents end at the same position, the smaller of them is the | |
13235 earlier one in the e-order (@code{EXTENT_E_LESS} is true). | |
13236 | |
13237 The display order and the e-order are complementary orders: any | |
13238 theorem about the display order also applies to the e-order if you swap | |
13239 all occurrences of ``display order'' and ``e-order'', ``less than'' and | |
13240 ``greater than'', and ``extent start'' and ``extent end''. | |
13241 | |
13242 @node Format of the Extent Info, Zero-Length Extents, Extent Ordering, Extents | |
13243 @section Format of the Extent Info | |
13244 @cindex extent info, format of the | |
13245 | |
13246 An extent-info structure consists of a list of the buffer or string's | |
13247 extents and a @dfn{stack of extents} that lists all of the extents over | |
13248 a particular position. The stack-of-extents info is used for | |
13249 optimization purposes---it basically caches some info that might | |
13250 be expensive to compute. Certain otherwise hard computations are easy | |
13251 given the stack of extents over a particular position, and if the | |
13252 stack of extents over a nearby position is known (because it was | |
13253 calculated at some prior point in time), it's easy to move the stack | |
13254 of extents to the proper position. | |
13255 | |
13256 Given that the stack of extents is an optimization, and given that | |
13257 it requires memory, a string's stack of extents is wiped out each | |
13258 time a garbage collection occurs. Therefore, any time you retrieve | |
13259 the stack of extents, it might not be there. If you need it to | |
13260 be there, use the @code{_force} version. | |
13261 | |
13262 Similarly, a string may or may not have an extent_info structure. | |
13263 (Generally it won't if there haven't been any extents added to the | |
13264 string.) So use the @code{_force} version if you need the extent_info | |
13265 structure to be there. | |
13266 | |
13267 A list of extents is maintained as a double gap array. One gap array | |
13268 is ordered by start index (the @dfn{display order}) and the other is | |
13269 ordered by end index (the @dfn{e-order}). Note that positions in an | |
13270 extent list should logically be conceived of as referring @emph{to} a | |
13271 particular extent (as is the norm in programs) rather than sitting | |
13272 between two extents. Note also that callers of these functions should | |
13273 not be aware of the fact that the extent list is implemented as an | |
13274 array, except for the fact that positions are integers (this should be | |
13275 generalized to handle integers and linked list equally well). | |
13276 | |
13277 A gap array is the same structure used by buffer text: an array of | |
13278 elements with a "gap" somewhere in the middle. Insertion and deletion | |
13279 happens by moving the gap to the insertion/deletion point, and then | |
13280 expanding/contracting as necessary. Gap arrays have a number of | |
13281 useful properties: | |
13282 | |
13283 @enumerate | |
13284 @item | |
13285 They are space efficient, as there is no need for next/previous pointers. | |
13286 | |
13287 @item | |
13288 If the items in them are sorted, locating an item is fast -- @math{O(log N)}. | |
13289 | |
13290 @item | |
13291 Insertion and deletion is very fast (constant time, essentially) if the | |
13292 gap is near (which favors localized operations, as will usually be the | |
13293 case). Even if not, it requires only a block move of memory, which is | |
13294 generally a highly optimized operation on modern processors. | |
13295 | |
13296 @item | |
13297 Code to manipulate them is relatively simple to write. | |
13298 @end enumerate | |
13299 | |
13300 An alternative would be balanced binary trees, which have guaranteed | |
13301 @math{O(log N)} time for all operations (although the constant factors | |
13302 are not as good, and repeated localized operations will be slower than | |
13303 for a gap array). Such code is quite tricky to write, however. | |
13304 | |
13305 @node Zero-Length Extents, Mathematics of Extent Ordering, Format of the Extent Info, Extents | |
13306 @section Zero-Length Extents | |
13307 @cindex zero-length extents | |
13308 @cindex extents, zero-length | |
13309 | |
13310 Extents can be zero-length, and will end up that way if their endpoints | |
13311 are explicitly set that way or if their detachable property is @code{nil} | |
13312 and all the text in the extent is deleted. (The exception is open-open | |
13313 zero-length extents, which are barred from existing because there is | |
13314 no sensible way to define their properties. Deletion of the text in | |
13315 an open-open extent causes it to be converted into a closed-open | |
13316 extent.) Zero-length extents are primarily used to represent | |
13317 annotations, and behave as follows: | |
13318 | |
13319 @enumerate | |
13320 @item | |
13321 Insertion at the position of a zero-length extent expands the extent | |
13322 if both endpoints are closed; goes after the extent if it is closed-open; | |
13323 and goes before the extent if it is open-closed. | |
13324 | |
13325 @item | |
13326 Deletion of a character on a side of a zero-length extent whose | |
13327 corresponding endpoint is closed causes the extent to be detached if | |
13328 it is detachable; if the extent is not detachable or the corresponding | |
13329 endpoint is open, the extent remains in the buffer, moving as necessary. | |
13330 @end enumerate | |
13331 | |
13332 Note that closed-open, non-detachable zero-length extents behave | |
13333 exactly like markers and that open-closed, non-detachable zero-length | |
13334 extents behave like the ``point-type'' marker in Mule. | |
13335 | |
13336 @node Mathematics of Extent Ordering, Extent Fragments, Zero-Length Extents, Extents | |
13337 @section Mathematics of Extent Ordering | |
13338 @cindex mathematics of extent ordering | |
13339 @cindex extent mathematics | |
13340 @cindex extent ordering | |
13341 | |
13342 @cindex display order of extents | |
13343 @cindex extents, display order | |
13344 The extents in a buffer are ordered by ``display order'' because that | |
13345 is that order that the redisplay mechanism needs to process them in. | |
13346 The e-order is an auxiliary ordering used to facilitate operations | |
13347 over extents. The operations that can be performed on the ordered | |
13348 list of extents in a buffer are | |
13349 | |
13350 @enumerate | |
13351 @item | |
13352 Locate where an extent would go if inserted into the list. | |
13353 @item | |
13354 Insert an extent into the list. | |
13355 @item | |
13356 Remove an extent from the list. | |
13357 @item | |
13358 Map over all the extents that overlap a range. | |
13359 @end enumerate | |
13360 | |
13361 (4) requires being able to determine the first and last extents | |
13362 that overlap a range. | |
13363 | |
13364 NOTE: @dfn{overlap} is used as follows: | |
13365 | |
13366 @itemize @bullet | |
13367 @item | |
13368 two ranges overlap if they have at least one point in common. | |
13369 Whether the endpoints are open or closed makes a difference here. | |
13370 @item | |
13371 a point overlaps a range if the point is contained within the | |
13372 range; this is equivalent to treating a point @math{P} as the range | |
13373 @math{[P, P]}. | |
13374 @item | |
13375 In the case of an @emph{extent} overlapping a point or range, the extent | |
13376 is normally treated as having closed endpoints. This applies | |
13377 consistently in the discussion of stacks of extents and such below. | |
13378 Note that this definition of overlap is not necessarily consistent with | |
13379 the extents that @code{map-extents} maps over, since @code{map-extents} | |
13380 sometimes pays attention to whether the endpoints of an extents are open | |
13381 or closed. But for our purposes, it greatly simplifies things to treat | |
13382 all extents as having closed endpoints. | |
13383 @end itemize | |
13384 | |
13385 First, define @math{>}, @math{<}, @math{<=}, etc. as applied to extents | |
13386 to mean comparison according to the display order. Comparison between | |
13387 an extent @math{E} and an index @math{I} means comparison between | |
13388 @math{E} and the range @math{[I, I]}. | |
13389 | |
13390 Also define @math{e>}, @math{e<}, @math{e<=}, etc. to mean comparison | |
13391 according to the e-order. | |
13392 | |
13393 For any range @math{R}, define @math{R(0)} to be the starting index of | |
13394 the range and @math{R(1)} to be the ending index of the range. | |
13395 | |
13396 For any extent @math{E}, define @math{E(next)} to be the extent directly | |
13397 following @math{E}, and @math{E(prev)} to be the extent directly | |
13398 preceding @math{E}. Assume @math{E(next)} and @math{E(prev)} can be | |
13399 determined from @math{E} in constant time. (This is because we store | |
13400 the extent list as a doubly linked list.) | |
13401 | |
13402 Similarly, define @math{E(e-next)} and @math{E(e-prev)} to be the | |
13403 extents directly following and preceding @math{E} in the e-order. | |
13404 | |
13405 Now: | |
13406 | |
13407 Let @math{R} be a range. | |
13408 Let @math{F} be the first extent overlapping @math{R}. | |
13409 Let @math{L} be the last extent overlapping @math{R}. | |
13410 | |
13411 Theorem 1: @math{R(1)} lies between @math{L} and @math{L(next)}, | |
13412 i.e. @math{L <= R(1) < L(next)}. | |
13413 | |
13414 This follows easily from the definition of display order. The | |
13415 basic reason that this theorem applies is that the display order | |
13416 sorts by increasing starting index. | |
13417 | |
13418 Therefore, we can determine @math{L} just by looking at where we would | |
13419 insert @math{R(1)} into the list, and if we know @math{F} and are moving | |
13420 forward over extents, we can easily determine when we've hit @math{L} by | |
13421 comparing the extent we're at to @math{R(1)}. | |
13422 | |
13423 @example | |
13424 Theorem 2: @math{F(e-prev) e< [1, R(0)] e<= F}. | |
13425 @end example | |
13426 | |
13427 This is the analog of Theorem 1, and applies because the e-order | |
13428 sorts by increasing ending index. | |
13429 | |
13430 Therefore, @math{F} can be found in the same amount of time as | |
13431 operation (1), i.e. the time that it takes to locate where an extent | |
13432 would go if inserted into the e-order list. This is @math{O(log N)}, | |
13433 since we are using gap arrays to manage extents. | |
13434 | |
13435 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents | |
13436 (ordered in display order and e-order, just like for normal extent | |
13437 lists) that overlap an index @math{I}. | |
13438 | |
13439 Now: | |
13440 | |
13441 Let @math{I} be an index, let @math{S} be the stack of extents on | |
13442 @math{I} and let @math{F} be the first extent in @math{S}. | |
13443 | |
13444 Theorem 3: The first extent in @math{S} is the first extent that overlaps | |
13445 any range @math{[I, J]}. | |
13446 | |
13447 Proof: Any extent that overlaps @math{[I, J]} but does not include | |
13448 @math{I} must have a start index @math{> I}, and thus be greater than | |
13449 any extent in @math{S}. | |
13450 | |
13451 Therefore, finding the first extent that overlaps a range @math{R} is | |
13452 the same as finding the first extent that overlaps @math{R(0)}. | |
13453 | |
13454 Theorem 4: Let @math{I2} be an index such that @math{I2 > I}, and let | |
13455 @math{F2} be the first extent that overlaps @math{I2}. Then, either | |
13456 @math{F2} is in @math{S} or @math{F2} is greater than any extent in | |
13457 @math{S}. | |
13458 | |
13459 Proof: If @math{F2} does not include @math{I} then its start index is | |
13460 greater than @math{I} and thus it is greater than any extent in | |
13461 @math{S}, including @math{F}. Otherwise, @math{F2} includes @math{I} | |
13462 and thus is in @math{S}, and thus @math{F2 >= F}. | |
13463 | |
13464 @node Extent Fragments, , Mathematics of Extent Ordering, Extents | |
13465 @section Extent Fragments | |
13466 @cindex extent fragments | |
13467 @cindex fragments, extent | |
13468 | |
13469 Imagine that the buffer is divided up into contiguous, non-overlapping | |
13470 @dfn{runs} of text such that no extent starts or ends within a run | |
13471 (extents that abut the run don't count). | |
13472 | |
13473 An extent fragment is a structure that holds data about the run that | |
13474 contains a particular buffer position (if the buffer position is at the | |
13475 junction of two runs, the run after the position is used)---the | |
13476 beginning and end of the run, a list of all of the extents in that run, | |
13477 the @dfn{merged face} that results from merging all of the faces | |
13478 corresponding to those extents, the begin and end glyphs at the | |
13479 beginning of the run, etc. This is the information that redisplay needs | |
13480 in order to display this run. | |
13481 | |
13482 Extent fragments have to be very quick to update to a new buffer | |
13483 position when moving linearly through the buffer. They rely on the | |
13484 stack-of-extents code, which does the heavy-duty algorithmic work of | |
13485 determining which extents overly a particular position. | |
13486 | |
13487 @node Faces, Glyphs, Extents, Top | |
13488 @chapter Faces | |
13489 @cindex faces | |
13490 | |
13491 Not yet documented. | |
13492 | |
13493 @node Glyphs, Specifiers, Faces, Top | |
13494 @chapter Glyphs | |
13495 @cindex glyphs | |
13496 | |
13497 Glyphs are graphical elements that can be displayed in XEmacs buffers or | |
13498 gutters. We use the term graphical element here in the broadest possible | |
13499 sense since glyphs can be as mundane as text or as arcane as a native | |
13500 tab widget. | |
13501 | |
13502 In XEmacs, glyphs represent the uninstantiated state of graphical | |
13503 elements, i.e. they hold all the information necessary to produce an | |
13504 image on-screen but the image need not exist at this stage, and multiple | |
13505 screen images can be instantiated from a single glyph. | |
13506 | |
13507 @c #### find a place for this discussion | |
13508 @c The decision to make image specifiers a separate type is debatable. | |
13509 @c In fact, the design decision to create a separate image specifier | |
13510 @c type, rather than make glyphs themselves be specifiers, is | |
13511 @c debatable---the other properties of glyphs are rarely used and could | |
13512 @c conceivably have been incorporated into the glyph's instantiator. | |
13513 @c The rarely used glyph types (buffer, pointer, icon) could also have | |
13514 @c been incorporated into the instantiator. | |
13515 | |
13516 Glyphs are lazily instantiated by calling one of the glyph | |
13517 functions. This usually occurs within redisplay when | |
13518 @code{Fglyph_height} is called. Instantiation causes an image-instance | |
13519 to be created and cached. This cache is on a per-device basis for all glyphs | |
13520 except widget-glyphs, and on a per-window basis for widgets-glyphs. The | |
13521 caching is done by @code{image_instantiate} and is necessary because it | |
13522 is generally possible to display an image-instance in multiple | |
13523 domains. For instance if we create a Pixmap, we can actually display | |
13524 this on multiple windows - even though we only need a single Pixmap | |
13525 instance to do this. If caching wasn't done then it would be necessary | |
13526 to create image-instances for every displayable occurrence of a glyph - | |
13527 and every usage - and this would be extremely memory and cpu intensive. | |
13528 | |
13529 Widget-glyphs (a.k.a native widgets) are not cached in this way. This is | |
13530 because widget-glyph image-instances on screen are toolkit windows, and | |
13531 thus cannot be reused in multiple XEmacs domains. Thus widget-glyphs are | |
13532 cached on an XEmacs window basis. | |
13533 | |
13534 Any action on a glyph first consults the cache before actually | |
13535 instantiating a widget. | |
13536 | |
13537 @section Glyph Instantiation | |
13538 @cindex glyph instantiation | |
13539 @cindex instantiation, glyph | |
13540 | |
13541 Glyph instantiation is a hairy topic and requires some explanation. The | |
13542 guts of glyph instantiation is contained within | |
13543 @code{image_instantiate}. A glyph contains an image which is a | |
13544 specifier. When a glyph function - for instance @code{Fglyph_height} - | |
13545 asks for a property of the glyph that can only be determined from its | |
13546 instantiated state, then the glyph image is instantiated and an image | |
13547 instance created. The instantiation process is governed by the specifier | |
13548 code and goes through a series of steps: | |
13549 | |
13550 @itemize @bullet | |
13551 @item | |
13552 Validation. Instantiation of image instances happens dynamically - often | |
13553 within the guts of redisplay. Thus it is often not feasible to catch | |
13554 instantiator errors at instantiation time. Instead the instantiator is | |
13555 validated at the time it is added to the image specifier. This function | |
13556 is defined by @code{image_validate} and at a simple level validates | |
13557 keyword value pairs. | |
13558 @item | |
13559 Duplication. The specifier code by default takes a copy of the | |
13560 instantiator. This is reasonable for most specifiers but in the case of | |
13561 widget-glyphs can be problematic, since some of the properties in the | |
13562 instantiator - for instance callbacks - could cause infinite recursion | |
13563 in the copying process. Thus the image code defines a function - | |
13564 @code{image_copy_instantiator} - which will selectively copy values. | |
13565 This is controlled by the way that a keyword is defined either using | |
13566 @code{IIFORMAT_VALID_KEYWORD} or | |
13567 @code{IIFORMAT_VALID_NONCOPY_KEYWORD}. Note that the image caching and | |
13568 redisplay code relies on instantiator copying to ensure that current and | |
13569 new instantiators are actually different rather than referring to the | |
13570 same thing. | |
13571 @item | |
13572 Normalization. Once the instantiator has been copied it must be | |
13573 converted into a form that is viable at instantiation time. This can | |
13574 involve no changes at all, but typically involves things like converting | |
13575 file names to the actual data. This function is defined by | |
13576 @code{image_going_to_add} and @code{normalize_image_instantiator}. | |
13577 @item | |
13578 Instantiation. When an image instance is actually required for display | |
13579 it is instantiated using @code{image_instantiate}. This involves calling | |
13580 instantiate methods that are specific to the type of image being | |
13581 instantiated. | |
13582 @end itemize | |
13583 | |
13584 The final instantiation phase also involves a number of steps. In order | |
13585 to understand these we need to describe a number of concepts. | |
13586 | |
13587 An image is instantiated in a @dfn{domain}, where a domain can be any | |
13588 one of a device, frame, window or image-instance. The domain gives the | |
13589 image-instance context and identity and properties that affect the | |
13590 appearance of the image-instance may be different for the same glyph | |
13591 instantiated in different domains. An example is the face used to | |
13592 display the image-instance. | |
13593 | |
13594 Although an image is instantiated in a particular domain the | |
13595 instantiation domain is not necessarily the domain in which the | |
13596 image-instance is cached. For example a pixmap can be instantiated in a | |
13597 window be actually be cached on a per-device basis. The domain in which | |
13598 the image-instance is actually cached is called the | |
13599 @dfn{governing-domain}. A governing-domain is currently either a device | |
13600 or a window. Widget-glyphs and text-glyphs have a window as a | |
13601 governing-domain, all other image-instances have a device as the | |
13602 governing-domain. The governing domain for an image-instance is | |
13603 determined using the governing_domain image-instance method. | |
13604 | |
13605 @section Widget-Glyphs | |
13606 @cindex widget-glyphs | |
13607 | |
13608 @section Widget-Glyphs in the MS-Windows Environment | |
13609 @cindex widget-glyphs in the MS-Windows environment | |
13610 @cindex MS-Windows environment, widget-glyphs in the | |
13611 | |
13612 To Do | |
13613 | |
13614 @section Widget-Glyphs in the X Environment | |
13615 @cindex widget-glyphs in the X environment | |
13616 @cindex X environment, widget-glyphs in the | |
13617 | |
13618 Widget-glyphs under X make heavy use of lwlib (@pxref{Lucid Widget | |
13619 Library}) for manipulating the native toolkit objects. This is primarily | |
13620 so that different toolkits can be supported for widget-glyphs, just as | |
13621 they are supported for features such as menubars etc. | |
13622 | |
13623 Lwlib is extremely poorly documented and quite hairy so here is my | |
13624 understanding of what goes on. | |
13625 | |
13626 Lwlib maintains a set of widget_instances which mirror the hierarchical | |
13627 state of Xt widgets. I think this is so that widgets can be updated and | |
13628 manipulated generically by the lwlib library. For instance | |
13629 update_one_widget_instance can cope with multiple types of widget and | |
13630 multiple types of toolkit. Each element in the widget hierarchy is updated | |
13631 from its corresponding widget_instance by walking the widget_instance | |
13632 tree recursively. | |
13633 | |
13634 This has desirable properties such as lw_modify_all_widgets which is | |
13635 called from @file{glyphs-x.c} and updates all the properties of a widget | |
13636 without having to know what the widget is or what toolkit it is from. | |
13637 Unfortunately this also has hairy properties such as making the lwlib | |
13638 code quite complex. And of course lwlib has to know at some level what | |
13639 the widget is and how to set its properties. | |
13640 | |
13641 @node Specifiers, Menus, Glyphs, Top | |
13642 @chapter Specifiers | |
13643 @cindex specifiers | |
13644 | |
13645 Not yet documented. | |
13646 | |
13647 Specifiers are documented in depth in the Lisp Reference manual. | |
13648 @xref{Specifiers,,, lispref, XEmacs Lisp Reference Manual}. The code in | |
13649 @file{specifier.c} is pretty straightforward. | |
13650 | |
13651 @node Menus, Events and the Event Loop, Specifiers, Top | |
13652 @chapter Menus | |
13653 @cindex menus | |
13654 | |
13655 A menu is set by setting the value of the variable | |
13656 @code{current-menubar} (which may be buffer-local) and then calling | |
13657 @code{set-menubar-dirty-flag} to signal a change. This will cause the | |
13658 menu to be redrawn at the next redisplay. The format of the data in | |
13659 @code{current-menubar} is described in @file{menubar.c}. | |
13660 | |
13661 Internally the data in current-menubar is parsed into a tree of | |
13662 @code{widget_value's} (defined in @file{lwlib.h}); this is accomplished | |
13663 by the recursive function @code{menu_item_descriptor_to_widget_value()}, | |
13664 called by @code{compute_menubar_data()}. Such a tree is deallocated | |
13665 using @code{free_widget_value()}. | |
13666 | |
13667 @code{update_screen_menubars()} is one of the external entry points. | |
13668 This checks to see, for each screen, if that screen's menubar needs to | |
13669 be updated. This is the case if | |
13670 | |
13671 @enumerate | |
13672 @item | |
13673 @code{set-menubar-dirty-flag} was called since the last redisplay. (This | |
13674 function sets the C variable menubar_has_changed.) | |
13675 @item | |
13676 The buffer displayed in the screen has changed. | |
13677 @item | |
13678 The screen has no menubar currently displayed. | |
13679 @end enumerate | |
13680 | |
13681 @code{set_screen_menubar()} is called for each such screen. This | |
13682 function calls @code{compute_menubar_data()} to create the tree of | |
13683 widget_value's, then calls @code{lw_create_widget()}, | |
13684 @code{lw_modify_all_widgets()}, and/or @code{lw_destroy_all_widgets()} | |
13685 to create the X-Toolkit widget associated with the menu. | |
13686 | |
13687 @code{update_psheets()}, the other external entry point, actually | |
13688 changes the menus being displayed. It uses the widgets fixed by | |
13689 @code{update_screen_menubars()} and calls various X functions to ensure | |
13690 that the menus are displayed properly. | |
13691 | |
13692 The menubar widget is set up so that @code{pre_activate_callback()} is | |
13693 called when the menu is first selected (i.e. mouse button goes down), | |
13694 and @code{menubar_selection_callback()} is called when an item is | |
13695 selected. @code{pre_activate_callback()} calls the function in | |
13696 activate-menubar-hook, which can change the menubar (this is described | |
13697 in @file{menubar.c}). If the menubar is changed, | |
13698 @code{set_screen_menubars()} is called. | |
13699 @code{menubar_selection_callback()} enqueues a menu event, putting in it | |
13700 a function to call (either @code{eval} or @code{call-interactively}) and | |
13701 its argument, which is the callback function or form given in the menu's | |
13702 description. | |
13703 | |
13704 @node Events and the Event Loop, Asynchronous Events; Quit Checking, Menus, Top | |
7475 @chapter Events and the Event Loop | 13705 @chapter Events and the Event Loop |
7476 @cindex events and the event loop | 13706 @cindex events and the event loop |
7477 @cindex event loop, events and the | 13707 @cindex event loop, events and the |
7478 | 13708 |
7479 @menu | 13709 @menu |
8284 the only code remaining is code to call out to Lisp or provide simple | 14514 the only code remaining is code to call out to Lisp or provide simple |
8285 bootstrapping implementations early in temacs, before the echo-area Lisp | 14515 bootstrapping implementations early in temacs, before the echo-area Lisp |
8286 code is loaded). | 14516 code is loaded). |
8287 | 14517 |
8288 | 14518 |
8289 @node Asynchronous Events; Quit Checking, Evaluation; Stack Frames; Bindings, Events and the Event Loop, Top | 14519 @node Asynchronous Events; Quit Checking, Lstreams, Events and the Event Loop, Top |
8290 @chapter Asynchronous Events; Quit Checking | 14520 @chapter Asynchronous Events; Quit Checking |
8291 @cindex asynchronous events; quit checking | 14521 @cindex asynchronous events; quit checking |
8292 @cindex asynchronous events | 14522 @cindex asynchronous events |
8293 | 14523 |
8294 @menu | 14524 @menu |
8610 @item | 14840 @item |
8611 printing code does not do code conversion or gettext when | 14841 printing code does not do code conversion or gettext when |
8612 printing to stdout/stderr. | 14842 printing to stdout/stderr. |
8613 @end itemize | 14843 @end itemize |
8614 | 14844 |
8615 @node Evaluation; Stack Frames; Bindings, Symbols and Variables, Asynchronous Events; Quit Checking, Top | 14845 @node Lstreams, Subprocesses, Asynchronous Events; Quit Checking, Top |
8616 @chapter Evaluation; Stack Frames; Bindings | |
8617 @cindex evaluation; stack frames; bindings | |
8618 @cindex stack frames; bindings, evaluation; | |
8619 @cindex bindings, evaluation; stack frames; | |
8620 | |
8621 @menu | |
8622 * Evaluation:: | |
8623 * Dynamic Binding; The specbinding Stack; Unwind-Protects:: | |
8624 * Simple Special Forms:: | |
8625 * Catch and Throw:: | |
8626 @end menu | |
8627 | |
8628 @node Evaluation, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings, Evaluation; Stack Frames; Bindings | |
8629 @section Evaluation | |
8630 @cindex evaluation | |
8631 | |
8632 @code{Feval()} evaluates the form (a Lisp object) that is passed to | |
8633 it. Note that evaluation is only non-trivial for two types of objects: | |
8634 symbols and conses. A symbol is evaluated simply by calling | |
8635 @code{symbol-value} on it and returning the value. | |
8636 | |
8637 Evaluating a cons means calling a function. First, @code{eval} checks | |
8638 to see if garbage-collection is necessary, and calls | |
8639 @code{garbage_collect_1()} if so. It then increases the evaluation | |
8640 depth by 1 (@code{lisp_eval_depth}, which is always less than | |
8641 @code{max_lisp_eval_depth}) and adds an element to the linked list of | |
8642 @code{struct backtrace}'s (@code{backtrace_list}). Each such structure | |
8643 contains a pointer to the function being called plus a list of the | |
8644 function's arguments. Originally these values are stored unevalled, and | |
8645 as they are evaluated, the backtrace structure is updated. Garbage | |
8646 collection pays attention to the objects pointed to in the backtrace | |
8647 structures (garbage collection might happen while a function is being | |
8648 called or while an argument is being evaluated, and there could easily | |
8649 be no other references to the arguments in the argument list; once an | |
8650 argument is evaluated, however, the unevalled version is not needed by | |
8651 eval, and so the backtrace structure is changed). | |
8652 | |
8653 At this point, the function to be called is determined by looking at | |
8654 the car of the cons (if this is a symbol, its function definition is | |
8655 retrieved and the process repeated). The function should then consist | |
8656 of either a @code{Lisp_Subr} (built-in function written in C), a | |
8657 @code{Lisp_Compiled_Function} object, or a cons whose car is one of the | |
8658 symbols @code{autoload}, @code{macro} or @code{lambda}. | |
8659 | |
8660 If the function is a @code{Lisp_Subr}, the lisp object points to a | |
8661 @code{struct Lisp_Subr} (created by @code{DEFUN()}), which contains a | |
8662 pointer to the C function, a minimum and maximum number of arguments | |
8663 (or possibly the special constants @code{MANY} or @code{UNEVALLED}), a | |
8664 pointer to the symbol referring to that subr, and a couple of other | |
8665 things. If the subr wants its arguments @code{UNEVALLED}, they are | |
8666 passed raw as a list. Otherwise, an array of evaluated arguments is | |
8667 created and put into the backtrace structure, and either passed whole | |
8668 (@code{MANY}) or each argument is passed as a C argument. | |
8669 | |
8670 If the function is a @code{Lisp_Compiled_Function}, | |
8671 @code{funcall_compiled_function()} is called. If the function is a | |
8672 lambda list, @code{funcall_lambda()} is called. If the function is a | |
8673 macro, [..... fill in] is done. If the function is an autoload, | |
8674 @code{do_autoload()} is called to load the definition and then eval | |
8675 starts over [explain this more]. | |
8676 | |
8677 When @code{Feval()} exits, the evaluation depth is reduced by one, the | |
8678 debugger is called if appropriate, and the current backtrace structure | |
8679 is removed from the list. | |
8680 | |
8681 Both @code{funcall_compiled_function()} and @code{funcall_lambda()} need | |
8682 to go through the list of formal parameters to the function and bind | |
8683 them to the actual arguments, checking for @code{&rest} and | |
8684 @code{&optional} symbols in the formal parameters and making sure the | |
8685 number of actual arguments is correct. | |
8686 @code{funcall_compiled_function()} can do this a little more | |
8687 efficiently, since the formal parameter list can be checked for sanity | |
8688 when the compiled function object is created. | |
8689 | |
8690 @code{funcall_lambda()} simply calls @code{Fprogn} to execute the code | |
8691 in the lambda list. | |
8692 | |
8693 @code{funcall_compiled_function()} calls the real byte-code interpreter | |
8694 @code{execute_optimized_program()} on the byte-code instructions, which | |
8695 are converted into an internal form for faster execution. | |
8696 | |
8697 When a compiled function is executed for the first time by | |
8698 @code{funcall_compiled_function()}, or during the dump phase of building | |
8699 XEmacs, the byte-code instructions are converted from a | |
8700 @code{Lisp_String} (which is inefficient to access, especially in the | |
8701 presence of MULE) into a @code{Lisp_Opaque} object containing an array | |
8702 of unsigned char, which can be directly executed by the byte-code | |
8703 interpreter. At this time the byte code is also analyzed for validity | |
8704 and transformed into a more optimized form, so that | |
8705 @code{execute_optimized_program()} can really fly. | |
8706 | |
8707 Here are some of the optimizations performed by the internal byte-code | |
8708 transformer: | |
8709 @enumerate | |
8710 @item | |
8711 References to the @code{constants} array are checked for out-of-range | |
8712 indices, so that the byte interpreter doesn't have to. | |
8713 @item | |
8714 References to the @code{constants} array that will be used as a Lisp | |
8715 variable are checked for being correct non-constant (i.e. not @code{t}, | |
8716 @code{nil}, or @code{keywordp}) symbols, so that the byte interpreter | |
8717 doesn't have to. | |
8718 @item | |
8719 The maximum number of variable bindings in the byte-code is | |
8720 pre-computed, so that space on the @code{specpdl} stack can be | |
8721 pre-reserved once for the whole function execution. | |
8722 @item | |
8723 All byte-code jumps are relative to the current program counter instead | |
8724 of the start of the program, thereby saving a register. | |
8725 @item | |
8726 One-byte relative jumps are converted from the byte-code form of unsigned | |
8727 chars offset by 127 to machine-friendly signed chars. | |
8728 @end enumerate | |
8729 | |
8730 Of course, this transformation of the @code{instructions} should not be | |
8731 visible to the user, so @code{Fcompiled_function_instructions()} needs | |
8732 to know how to convert the optimized opaque object back into a Lisp | |
8733 string that is identical to the original string from the @file{.elc} | |
8734 file. (Actually, the resulting string may (rarely) contain slightly | |
8735 different, yet equivalent, byte code.) | |
8736 | |
8737 @code{Ffuncall()} implements Lisp @code{funcall}. @code{(funcall fun | |
8738 x1 x2 x3 ...)} is equivalent to @code{(eval (list fun (quote x1) (quote | |
8739 x2) (quote x3) ...))}. @code{Ffuncall()} contains its own code to do | |
8740 the evaluation, however, and is very similar to @code{Feval()}. | |
8741 | |
8742 From the performance point of view, it is worth knowing that most of the | |
8743 time in Lisp evaluation is spent executing @code{Lisp_Subr} and | |
8744 @code{Lisp_Compiled_Function} objects via @code{Ffuncall()} (not | |
8745 @code{Feval()}). | |
8746 | |
8747 @code{Fapply()} implements Lisp @code{apply}, which is very similar to | |
8748 @code{funcall} except that if the last argument is a list, the result is the | |
8749 same as if each of the arguments in the list had been passed separately. | |
8750 @code{Fapply()} does some business to expand the last argument if it's a | |
8751 list, then calls @code{Ffuncall()} to do the work. | |
8752 | |
8753 @code{apply1()}, @code{call0()}, @code{call1()}, @code{call2()}, and | |
8754 @code{call3()} call a function, passing it the argument(s) given (the | |
8755 arguments are given as separate C arguments rather than being passed as | |
8756 an array). @code{apply1()} uses @code{Fapply()} while the others use | |
8757 @code{Ffuncall()} to do the real work. | |
8758 | |
8759 @node Dynamic Binding; The specbinding Stack; Unwind-Protects, Simple Special Forms, Evaluation, Evaluation; Stack Frames; Bindings | |
8760 @section Dynamic Binding; The specbinding Stack; Unwind-Protects | |
8761 @cindex dynamic binding; the specbinding stack; unwind-protects | |
8762 @cindex binding; the specbinding stack; unwind-protects, dynamic | |
8763 @cindex specbinding stack; unwind-protects, dynamic binding; the | |
8764 @cindex unwind-protects, dynamic binding; the specbinding stack; | |
8765 | |
8766 @example | |
8767 struct specbinding | |
8768 @{ | |
8769 Lisp_Object symbol; | |
8770 Lisp_Object old_value; | |
8771 Lisp_Object (*func) (Lisp_Object); /* for unwind-protect */ | |
8772 @}; | |
8773 @end example | |
8774 | |
8775 @code{struct specbinding} is used for local-variable bindings and | |
8776 unwind-protects. @code{specpdl} holds an array of @code{struct specbinding}'s, | |
8777 @code{specpdl_ptr} points to the beginning of the free bindings in the | |
8778 array, @code{specpdl_size} specifies the total number of binding slots | |
8779 in the array, and @code{max_specpdl_size} specifies the maximum number | |
8780 of bindings the array can be expanded to hold. @code{grow_specpdl()} | |
8781 increases the size of the @code{specpdl} array, multiplying its size by | |
8782 2 but never exceeding @code{max_specpdl_size} (except that if this | |
8783 number is less than 400, it is first set to 400). | |
8784 | |
8785 @code{specbind()} binds a symbol to a value and is used for local | |
8786 variables and @code{let} forms. The symbol and its old value (which | |
8787 might be @code{Qunbound}, indicating no prior value) are recorded in the | |
8788 specpdl array, and @code{specpdl_size} is increased by 1. | |
8789 | |
8790 @code{record_unwind_protect()} implements an @dfn{unwind-protect}, | |
8791 which, when placed around a section of code, ensures that some specified | |
8792 cleanup routine will be executed even if the code exits abnormally | |
8793 (e.g. through a @code{throw} or quit). @code{record_unwind_protect()} | |
8794 simply adds a new specbinding to the @code{specpdl} array and stores the | |
8795 appropriate information in it. The cleanup routine can either be a C | |
8796 function, which is stored in the @code{func} field, or a @code{progn} | |
8797 form, which is stored in the @code{old_value} field. | |
8798 | |
8799 @code{unbind_to()} removes specbindings from the @code{specpdl} array | |
8800 until the specified position is reached. Each specbinding can be one of | |
8801 three types: | |
8802 | |
8803 @enumerate | |
8804 @item | |
8805 an unwind-protect with a C cleanup function (@code{func} is not 0, and | |
8806 @code{old_value} holds an argument to be passed to the function); | |
8807 @item | |
8808 an unwind-protect with a Lisp form (@code{func} is 0, @code{symbol} | |
8809 is @code{nil}, and @code{old_value} holds the form to be executed with | |
8810 @code{Fprogn()}); or | |
8811 @item | |
8812 a local-variable binding (@code{func} is 0, @code{symbol} is not | |
8813 @code{nil}, and @code{old_value} holds the old value, which is stored as | |
8814 the symbol's value). | |
8815 @end enumerate | |
8816 | |
8817 @node Simple Special Forms, Catch and Throw, Dynamic Binding; The specbinding Stack; Unwind-Protects, Evaluation; Stack Frames; Bindings | |
8818 @section Simple Special Forms | |
8819 @cindex special forms, simple | |
8820 | |
8821 @code{or}, @code{and}, @code{if}, @code{cond}, @code{progn}, | |
8822 @code{prog1}, @code{prog2}, @code{setq}, @code{quote}, @code{function}, | |
8823 @code{let*}, @code{let}, @code{while} | |
8824 | |
8825 All of these are very simple and work as expected, calling | |
8826 @code{Feval()} or @code{Fprogn()} as necessary and (in the case of | |
8827 @code{let} and @code{let*}) using @code{specbind()} to create bindings | |
8828 and @code{unbind_to()} to undo the bindings when finished. | |
8829 | |
8830 Note that, with the exception of @code{Fprogn}, these functions are | |
8831 typically called in real life only in interpreted code, since the byte | |
8832 compiler knows how to convert calls to these functions directly into | |
8833 byte code. | |
8834 | |
8835 @node Catch and Throw, , Simple Special Forms, Evaluation; Stack Frames; Bindings | |
8836 @section Catch and Throw | |
8837 @cindex catch and throw | |
8838 @cindex throw, catch and | |
8839 | |
8840 @example | |
8841 struct catchtag | |
8842 @{ | |
8843 Lisp_Object tag; | |
8844 Lisp_Object val; | |
8845 struct catchtag *next; | |
8846 struct gcpro *gcpro; | |
8847 jmp_buf jmp; | |
8848 struct backtrace *backlist; | |
8849 int lisp_eval_depth; | |
8850 int pdlcount; | |
8851 @}; | |
8852 @end example | |
8853 | |
8854 @code{catch} is a Lisp function that places a catch around a body of | |
8855 code. A catch is a means of non-local exit from the code. When a catch | |
8856 is created, a tag is specified, and executing a @code{throw} to this tag | |
8857 will exit from the body of code caught with this tag, and its value will | |
8858 be the value given in the call to @code{throw}. If there is no such | |
8859 call, the code will be executed normally. | |
8860 | |
8861 Information pertaining to a catch is held in a @code{struct catchtag}, | |
8862 which is placed at the head of a linked list pointed to by | |
8863 @code{catchlist}. @code{internal_catch()} is passed a C function to | |
8864 call (@code{Fprogn()} when Lisp @code{catch} is called) and arguments to | |
8865 give it, and places a catch around the function. Each @code{struct | |
8866 catchtag} is held in the stack frame of the @code{internal_catch()} | |
8867 instance that created the catch. | |
8868 | |
8869 @code{internal_catch()} is fairly straightforward. It stores into the | |
8870 @code{struct catchtag} the tag name and the current values of | |
8871 @code{backtrace_list}, @code{lisp_eval_depth}, @code{gcprolist}, and the | |
8872 offset into the @code{specpdl} array, sets a jump point with @code{_setjmp()} | |
8873 (storing the jump point into the @code{struct catchtag}), and calls the | |
8874 function. Control will return to @code{internal_catch()} either when | |
8875 the function exits normally or through a @code{_longjmp()} to this jump | |
8876 point. In the latter case, @code{throw} will store the value to be | |
8877 returned into the @code{struct catchtag} before jumping. When it's | |
8878 done, @code{internal_catch()} removes the @code{struct catchtag} from | |
8879 the catchlist and returns the proper value. | |
8880 | |
8881 @code{Fthrow()} goes up through the catchlist until it finds one with | |
8882 a matching tag. It then calls @code{unbind_catch()} to restore | |
8883 everything to what it was when the appropriate catch was set, stores the | |
8884 return value in the @code{struct catchtag}, and jumps (with | |
8885 @code{_longjmp()}) to its jump point. | |
8886 | |
8887 @code{unbind_catch()} removes all catches from the catchlist until it | |
8888 finds the correct one. Some of the catches might have been placed for | |
8889 error-trapping, and if so, the appropriate entries on the handlerlist | |
8890 must be removed (see ``errors''). @code{unbind_catch()} also restores | |
8891 the values of @code{gcprolist}, @code{backtrace_list}, and | |
8892 @code{lisp_eval}, and calls @code{unbind_to()} to undo any specbindings | |
8893 created since the catch. | |
8894 | |
8895 | |
8896 @node Symbols and Variables, Buffers, Evaluation; Stack Frames; Bindings, Top | |
8897 @chapter Symbols and Variables | |
8898 @cindex symbols and variables | |
8899 @cindex variables, symbols and | |
8900 | |
8901 @menu | |
8902 * Introduction to Symbols:: | |
8903 * Obarrays:: | |
8904 * Symbol Values:: | |
8905 @end menu | |
8906 | |
8907 @node Introduction to Symbols, Obarrays, Symbols and Variables, Symbols and Variables | |
8908 @section Introduction to Symbols | |
8909 @cindex symbols, introduction to | |
8910 | |
8911 A symbol is basically just an object with four fields: a name (a | |
8912 string), a value (some Lisp object), a function (some Lisp object), and | |
8913 a property list (usually a list of alternating keyword/value pairs). | |
8914 What makes symbols special is that there is usually only one symbol with | |
8915 a given name, and the symbol is referred to by name. This makes a | |
8916 symbol a convenient way of calling up data by name, i.e. of implementing | |
8917 variables. (The variable's value is stored in the @dfn{value slot}.) | |
8918 Similarly, functions are referenced by name, and the definition of the | |
8919 function is stored in a symbol's @dfn{function slot}. This means that | |
8920 there can be a distinct function and variable with the same name. The | |
8921 property list is used as a more general mechanism of associating | |
8922 additional values with particular names, and once again the namespace is | |
8923 independent of the function and variable namespaces. | |
8924 | |
8925 @node Obarrays, Symbol Values, Introduction to Symbols, Symbols and Variables | |
8926 @section Obarrays | |
8927 @cindex obarrays | |
8928 | |
8929 The identity of symbols with their names is accomplished through a | |
8930 structure called an obarray, which is just a poorly-implemented hash | |
8931 table mapping from strings to symbols whose name is that string. (I say | |
8932 ``poorly implemented'' because an obarray appears in Lisp as a vector | |
8933 with some hidden fields rather than as its own opaque type. This is an | |
8934 Emacs Lisp artifact that should be fixed.) | |
8935 | |
8936 Obarrays are implemented as a vector of some fixed size (which should | |
8937 be a prime for best results), where each ``bucket'' of the vector | |
8938 contains one or more symbols, threaded through a hidden @code{next} | |
8939 field in the symbol. Lookup of a symbol in an obarray, and adding a | |
8940 symbol to an obarray, is accomplished through standard hash-table | |
8941 techniques. | |
8942 | |
8943 The standard Lisp function for working with symbols and obarrays is | |
8944 @code{intern}. This looks up a symbol in an obarray given its name; if | |
8945 it's not found, a new symbol is automatically created with the specified | |
8946 name, added to the obarray, and returned. This is what happens when the | |
8947 Lisp reader encounters a symbol (or more precisely, encounters the name | |
8948 of a symbol) in some text that it is reading. There is a standard | |
8949 obarray called @code{obarray} that is used for this purpose, although | |
8950 the Lisp programmer is free to create his own obarrays and @code{intern} | |
8951 symbols in them. | |
8952 | |
8953 Note that, once a symbol is in an obarray, it stays there until | |
8954 something is done about it, and the standard obarray @code{obarray} | |
8955 always stays around, so once you use any particular variable name, a | |
8956 corresponding symbol will stay around in @code{obarray} until you exit | |
8957 XEmacs. | |
8958 | |
8959 Note that @code{obarray} itself is a variable, and as such there is a | |
8960 symbol in @code{obarray} whose name is @code{"obarray"} and which | |
8961 contains @code{obarray} as its value. | |
8962 | |
8963 Note also that this call to @code{intern} occurs only when in the Lisp | |
8964 reader, not when the code is executed (at which point the symbol is | |
8965 already around, stored as such in the definition of the function). | |
8966 | |
8967 You can create your own obarray using @code{make-vector} (this is | |
8968 horrible but is an artifact) and intern symbols into that obarray. | |
8969 Doing that will result in two or more symbols with the same name. | |
8970 However, at most one of these symbols is in the standard @code{obarray}: | |
8971 You cannot have two symbols of the same name in any particular obarray. | |
8972 Note that you cannot add a symbol to an obarray in any fashion other | |
8973 than using @code{intern}: i.e. you can't take an existing symbol and put | |
8974 it in an existing obarray. Nor can you change the name of an existing | |
8975 symbol. (Since obarrays are vectors, you can violate the consistency of | |
8976 things by storing directly into the vector, but let's ignore that | |
8977 possibility.) | |
8978 | |
8979 Usually symbols are created by @code{intern}, but if you really want, | |
8980 you can explicitly create a symbol using @code{make-symbol}, giving it | |
8981 some name. The resulting symbol is not in any obarray (i.e. it is | |
8982 @dfn{uninterned}), and you can't add it to any obarray. Therefore its | |
8983 primary purpose is as a symbol to use in macros to avoid namespace | |
8984 pollution. It can also be used as a carrier of information, but cons | |
8985 cells could probably be used just as well. | |
8986 | |
8987 You can also use @code{intern-soft} to look up a symbol but not create | |
8988 a new one, and @code{unintern} to remove a symbol from an obarray. This | |
8989 returns the removed symbol. (Remember: You can't put the symbol back | |
8990 into any obarray.) Finally, @code{mapatoms} maps over all of the symbols | |
8991 in an obarray. | |
8992 | |
8993 @node Symbol Values, , Obarrays, Symbols and Variables | |
8994 @section Symbol Values | |
8995 @cindex symbol values | |
8996 @cindex values, symbol | |
8997 | |
8998 The value field of a symbol normally contains a Lisp object. However, | |
8999 a symbol can be @dfn{unbound}, meaning that it logically has no value. | |
9000 This is internally indicated by storing a special Lisp object, called | |
9001 @dfn{the unbound marker} and stored in the global variable | |
9002 @code{Qunbound}. The unbound marker is of a special Lisp object type | |
9003 called @dfn{symbol-value-magic}. It is impossible for the Lisp | |
9004 programmer to directly create or access any object of this type. | |
9005 | |
9006 @strong{You must not let any ``symbol-value-magic'' object escape to | |
9007 the Lisp level.} Printing any of these objects will cause the message | |
9008 @samp{INTERNAL EMACS BUG} to appear as part of the print representation. | |
9009 (You may see this normally when you call @code{debug_print()} from the | |
9010 debugger on a Lisp object.) If you let one of these objects escape to | |
9011 the Lisp level, you will violate a number of assumptions contained in | |
9012 the C code and make the unbound marker not function right. | |
9013 | |
9014 When a symbol is created, its value field (and function field) are set | |
9015 to @code{Qunbound}. The Lisp programmer can restore these conditions | |
9016 later using @code{makunbound} or @code{fmakunbound}, and can query to | |
9017 see whether the value of function fields are @dfn{bound} (i.e. have a | |
9018 value other than @code{Qunbound}) using @code{boundp} and | |
9019 @code{fboundp}. The fields are set to a normal Lisp object using | |
9020 @code{set} (or @code{setq}) and @code{fset}. | |
9021 | |
9022 Other symbol-value-magic objects are used as special markers to | |
9023 indicate variables that have non-normal properties. This includes any | |
9024 variables that are tied into C variables (setting the variable magically | |
9025 sets some global variable in the C code, and likewise for retrieving the | |
9026 variable's value), variables that magically tie into slots in the | |
9027 current buffer, variables that are buffer-local, etc. The | |
9028 symbol-value-magic object is stored in the value cell in place of | |
9029 a normal object, and the code to retrieve a symbol's value | |
9030 (i.e. @code{symbol-value}) knows how to do special things with them. | |
9031 This means that you should not just fetch the value cell directly if you | |
9032 want a symbol's value. | |
9033 | |
9034 The exact workings of this are rather complex and involved and are | |
9035 well-documented in comments in @file{buffer.c}, @file{symbols.c}, and | |
9036 @file{lisp.h}. | |
9037 | |
9038 @node Buffers, Text, Symbols and Variables, Top | |
9039 @chapter Buffers | |
9040 @cindex buffers | |
9041 | |
9042 @menu | |
9043 * Introduction to Buffers:: A buffer holds a block of text such as a file. | |
9044 * Buffer Lists:: Keeping track of all buffers. | |
9045 * Markers and Extents:: Tagging locations within a buffer. | |
9046 * The Buffer Object:: The Lisp object corresponding to a buffer. | |
9047 @end menu | |
9048 | |
9049 @node Introduction to Buffers, Buffer Lists, Buffers, Buffers | |
9050 @section Introduction to Buffers | |
9051 @cindex buffers, introduction to | |
9052 | |
9053 A buffer is logically just a Lisp object that holds some text. | |
9054 In this, it is like a string, but a buffer is optimized for | |
9055 frequent insertion and deletion, while a string is not. Furthermore: | |
9056 | |
9057 @enumerate | |
9058 @item | |
9059 Buffers are @dfn{permanent} objects, i.e. once you create them, they | |
9060 remain around, and need to be explicitly deleted before they go away. | |
9061 @item | |
9062 Each buffer has a unique name, which is a string. Buffers are | |
9063 normally referred to by name. In this respect, they are like | |
9064 symbols. | |
9065 @item | |
9066 Buffers have a default insertion position, called @dfn{point}. | |
9067 Inserting text (unless you explicitly give a position) goes at point, | |
9068 and moves point forward past the text. This is what is going on when | |
9069 you type text into Emacs. | |
9070 @item | |
9071 Buffers have lots of extra properties associated with them. | |
9072 @item | |
9073 Buffers can be @dfn{displayed}. What this means is that there | |
9074 exist a number of @dfn{windows}, which are objects that correspond | |
9075 to some visible section of your display, and each window has | |
9076 an associated buffer, and the current contents of the buffer | |
9077 are shown in that section of the display. The redisplay mechanism | |
9078 (which takes care of doing this) knows how to look at the | |
9079 text of a buffer and come up with some reasonable way of displaying | |
9080 this. Many of the properties of a buffer control how the | |
9081 buffer's text is displayed. | |
9082 @item | |
9083 One buffer is distinguished and called the @dfn{current buffer}. It is | |
9084 stored in the variable @code{current_buffer}. Buffer operations operate | |
9085 on this buffer by default. When you are typing text into a buffer, the | |
9086 buffer you are typing into is always @code{current_buffer}. Switching | |
9087 to a different window changes the current buffer. Note that Lisp code | |
9088 can temporarily change the current buffer using @code{set-buffer} (often | |
9089 enclosed in a @code{save-excursion} so that the former current buffer | |
9090 gets restored when the code is finished). However, calling | |
9091 @code{set-buffer} will NOT cause a permanent change in the current | |
9092 buffer. The reason for this is that the top-level event loop sets | |
9093 @code{current_buffer} to the buffer of the selected window, each time | |
9094 it finishes executing a user command. | |
9095 @end enumerate | |
9096 | |
9097 Make sure you understand the distinction between @dfn{current buffer} | |
9098 and @dfn{buffer of the selected window}, and the distinction between | |
9099 @dfn{point} of the current buffer and @dfn{window-point} of the selected | |
9100 window. (This latter distinction is explained in detail in the section | |
9101 on windows.) | |
9102 | |
9103 @node Buffer Lists, Markers and Extents, Introduction to Buffers, Buffers | |
9104 @section Buffer Lists | |
9105 @cindex buffer lists | |
9106 | |
9107 Recall earlier that buffers are @dfn{permanent} objects, i.e. that | |
9108 they remain around until explicitly deleted. This entails that there is | |
9109 a list of all the buffers in existence. This list is actually an | |
9110 assoc-list (mapping from the buffer's name to the buffer) and is stored | |
9111 in the global variable @code{Vbuffer_alist}. | |
9112 | |
9113 The order of the buffers in the list is important: the buffers are | |
9114 ordered approximately from most-recently-used to least-recently-used. | |
9115 Switching to a buffer using @code{switch-to-buffer}, | |
9116 @code{pop-to-buffer}, etc. and switching windows using | |
9117 @code{other-window}, etc. usually brings the new current buffer to the | |
9118 front of the list. @code{switch-to-buffer}, @code{other-buffer}, | |
9119 etc. look at the beginning of the list to find an alternative buffer to | |
9120 suggest. You can also explicitly move a buffer to the end of the list | |
9121 using @code{bury-buffer}. | |
9122 | |
9123 In addition to the global ordering in @code{Vbuffer_alist}, each frame | |
9124 has its own ordering of the list. These lists always contain the same | |
9125 elements as in @code{Vbuffer_alist} although possibly in a different | |
9126 order. @code{buffer-list} normally returns the list for the selected | |
9127 frame. This allows you to work in separate frames without things | |
9128 interfering with each other. | |
9129 | |
9130 The standard way to look up a buffer given a name is | |
9131 @code{get-buffer}, and the standard way to create a new buffer is | |
9132 @code{get-buffer-create}, which looks up a buffer with a given name, | |
9133 creating a new one if necessary. These operations correspond exactly | |
9134 with the symbol operations @code{intern-soft} and @code{intern}, | |
9135 respectively. You can also force a new buffer to be created using | |
9136 @code{generate-new-buffer}, which takes a name and (if necessary) makes | |
9137 a unique name from this by appending a number, and then creates the | |
9138 buffer. This is basically like the symbol operation @code{gensym}. | |
9139 | |
9140 @node Markers and Extents, The Buffer Object, Buffer Lists, Buffers | |
9141 @section Markers and Extents | |
9142 @cindex markers and extents | |
9143 @cindex extents, markers and | |
9144 | |
9145 Among the things associated with a buffer are things that are | |
9146 logically attached to certain buffer positions. This can be used to | |
9147 keep track of a buffer position when text is inserted and deleted, so | |
9148 that it remains at the same spot relative to the text around it; to | |
9149 assign properties to particular sections of text; etc. There are two | |
9150 such objects that are useful in this regard: they are @dfn{markers} and | |
9151 @dfn{extents}. | |
9152 | |
9153 A @dfn{marker} is simply a flag placed at a particular buffer | |
9154 position, which is moved around as text is inserted and deleted. | |
9155 Markers are used for all sorts of purposes, such as the @code{mark} that | |
9156 is the other end of textual regions to be cut, copied, etc. | |
9157 | |
9158 An @dfn{extent} is similar to two markers plus some associated | |
9159 properties, and is used to keep track of regions in a buffer as text is | |
9160 inserted and deleted, and to add properties (e.g. fonts) to particular | |
9161 regions of text. The external interface of extents is explained | |
9162 elsewhere. | |
9163 | |
9164 The important thing here is that markers and extents simply contain | |
9165 buffer positions in them as integers, and every time text is inserted or | |
9166 deleted, these positions must be updated. In order to minimize the | |
9167 amount of shuffling that needs to be done, the positions in markers and | |
9168 extents (there's one per marker, two per extent) are stored in Membpos's. | |
9169 This means that they only need to be moved when the text is physically | |
9170 moved in memory; since the gap structure tries to minimize this, it also | |
9171 minimizes the number of marker and extent indices that need to be | |
9172 adjusted. Look in @file{insdel.c} for the details of how this works. | |
9173 | |
9174 One other important distinction is that markers are @dfn{temporary} | |
9175 while extents are @dfn{permanent}. This means that markers disappear as | |
9176 soon as there are no more pointers to them, and correspondingly, there | |
9177 is no way to determine what markers are in a buffer if you are just | |
9178 given the buffer. Extents remain in a buffer until they are detached | |
9179 (which could happen as a result of text being deleted) or the buffer is | |
9180 deleted, and primitives do exist to enumerate the extents in a buffer. | |
9181 | |
9182 @node The Buffer Object, , Markers and Extents, Buffers | |
9183 @section The Buffer Object | |
9184 @cindex buffer object, the | |
9185 @cindex object, the buffer | |
9186 | |
9187 Buffers contain fields not directly accessible by the Lisp programmer. | |
9188 We describe them here, naming them by the names used in the C code. | |
9189 Many are accessible indirectly in Lisp programs via Lisp primitives. | |
9190 | |
9191 @table @code | |
9192 @item name | |
9193 The buffer name is a string that names the buffer. It is guaranteed to | |
9194 be unique. @xref{Buffer Names,,, lispref, XEmacs Lisp Reference | |
9195 Manual}. | |
9196 | |
9197 @item save_modified | |
9198 This field contains the time when the buffer was last saved, as an | |
9199 integer. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference | |
9200 Manual}. | |
9201 | |
9202 @item modtime | |
9203 This field contains the modification time of the visited file. It is | |
9204 set when the file is written or read. Every time the buffer is written | |
9205 to the file, this field is compared to the modification time of the | |
9206 file. @xref{Buffer Modification,,, lispref, XEmacs Lisp Reference | |
9207 Manual}. | |
9208 | |
9209 @item auto_save_modified | |
9210 This field contains the time when the buffer was last auto-saved. | |
9211 | |
9212 @item last_window_start | |
9213 This field contains the @code{window-start} position in the buffer as of | |
9214 the last time the buffer was displayed in a window. | |
9215 | |
9216 @item undo_list | |
9217 This field points to the buffer's undo list. @xref{Undo,,, lispref, | |
9218 XEmacs Lisp Reference Manual}. | |
9219 | |
9220 @item syntax_table_v | |
9221 This field contains the syntax table for the buffer. @xref{Syntax | |
9222 Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
9223 | |
9224 @item downcase_table | |
9225 This field contains the conversion table for converting text to lower | |
9226 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
9227 | |
9228 @item upcase_table | |
9229 This field contains the conversion table for converting text to upper | |
9230 case. @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
9231 | |
9232 @item case_canon_table | |
9233 This field contains the conversion table for canonicalizing text for | |
9234 case-folding search. @xref{Case Tables,,, lispref, XEmacs Lisp | |
9235 Reference Manual}. | |
9236 | |
9237 @item case_eqv_table | |
9238 This field contains the equivalence table for case-folding search. | |
9239 @xref{Case Tables,,, lispref, XEmacs Lisp Reference Manual}. | |
9240 | |
9241 @item display_table | |
9242 This field contains the buffer's display table, or @code{nil} if it | |
9243 doesn't have one. @xref{Display Tables,,, lispref, XEmacs Lisp | |
9244 Reference Manual}. | |
9245 | |
9246 @item markers | |
9247 This field contains the chain of all markers that currently point into | |
9248 the buffer. Deletion of text in the buffer, and motion of the buffer's | |
9249 gap, must check each of these markers and perhaps update it. | |
9250 @xref{Markers,,, lispref, XEmacs Lisp Reference Manual}. | |
9251 | |
9252 @item backed_up | |
9253 This field is a flag that tells whether a backup file has been made for | |
9254 the visited file of this buffer. | |
9255 | |
9256 @item mark | |
9257 This field contains the mark for the buffer. The mark is a marker, | |
9258 hence it is also included on the list @code{markers}. @xref{The Mark,,, | |
9259 lispref, XEmacs Lisp Reference Manual}. | |
9260 | |
9261 @item mark_active | |
9262 This field is non-@code{nil} if the buffer's mark is active. | |
9263 | |
9264 @item local_var_alist | |
9265 This field contains the association list describing the variables local | |
9266 in this buffer, and their values, with the exception of local variables | |
9267 that have special slots in the buffer object. (Those slots are omitted | |
9268 from this table.) @xref{Buffer-Local Variables,,, lispref, XEmacs Lisp | |
9269 Reference Manual}. | |
9270 | |
9271 @item modeline_format | |
9272 This field contains a Lisp object which controls how to display the mode | |
9273 line for this buffer. @xref{Modeline Format,,, lispref, XEmacs Lisp | |
9274 Reference Manual}. | |
9275 | |
9276 @item base_buffer | |
9277 This field holds the buffer's base buffer (if it is an indirect buffer), | |
9278 or @code{nil}. | |
9279 @end table | |
9280 | |
9281 @node Text, Multilingual Support, Buffers, Top | |
9282 @chapter Text | |
9283 @cindex text | |
9284 | |
9285 @menu | |
9286 * The Text in a Buffer:: Representation of the text in a buffer. | |
9287 * Ibytes and Ichars:: Representation of individual characters. | |
9288 * Byte-Char Position Conversion:: | |
9289 * Searching and Matching:: Higher-level algorithms. | |
9290 @end menu | |
9291 | |
9292 @node The Text in a Buffer, Ibytes and Ichars, Text, Text | |
9293 @section The Text in a Buffer | |
9294 @cindex text in a buffer, the | |
9295 @cindex buffer, the text in a | |
9296 | |
9297 The text in a buffer consists of a sequence of zero or more | |
9298 characters. A @dfn{character} is an integer that logically represents | |
9299 a letter, number, space, or other unit of text. Most of the characters | |
9300 that you will typically encounter belong to the ASCII set of characters, | |
9301 but there are also characters for various sorts of accented letters, | |
9302 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana, | |
9303 etc.), Cyrillic and Greek letters, etc. The actual number of possible | |
9304 characters is quite large. | |
9305 | |
9306 For now, we can view a character as some non-negative integer that | |
9307 has some shape that defines how it typically appears (e.g. as an | |
9308 uppercase A). (The exact way in which a character appears depends on the | |
9309 font used to display the character.) The internal type of characters in | |
9310 the C code is an @code{Ichar}; this is just an @code{int}, but using a | |
9311 symbolic type makes the code clearer. | |
9312 | |
9313 Between every character in a buffer is a @dfn{buffer position} or | |
9314 @dfn{character position}. We can speak of the character before or after | |
9315 a particular buffer position, and when you insert a character at a | |
9316 particular position, all characters after that position end up at new | |
9317 positions. When we speak of the character @dfn{at} a position, we | |
9318 really mean the character after the position. (This schizophrenia | |
9319 between a buffer position being ``between'' two characters and ``on'' a | |
9320 character is rampant in Emacs.) | |
9321 | |
9322 Buffer positions are numbered starting at 1. This means that | |
9323 position 1 is before the first character, and position 0 is not | |
9324 valid. If there are N characters in a buffer, then buffer | |
9325 position N+1 is after the last one, and position N+2 is not valid. | |
9326 | |
9327 The internal makeup of the Ichar integer varies depending on whether | |
9328 we have compiled with MULE support. If not, the Ichar integer is an | |
9329 8-bit integer with possible values from 0 - 255. 0 - 127 are the | |
9330 standard ASCII characters, while 128 - 255 are the characters from the | |
9331 ISO-8859-1 character set. If we have compiled with MULE support, an | |
9332 Ichar is a 19-bit integer, with the various bits having meanings | |
9333 according to a complex scheme that will be detailed later. The | |
9334 characters numbered 0 - 255 still have the same meanings as for the | |
9335 non-MULE case, though. | |
9336 | |
9337 Internally, the text in a buffer is represented in a fairly simple | |
9338 fashion: as a contiguous array of bytes, with a @dfn{gap} of some size | |
9339 in the middle. Although the gap is of some substantial size in bytes, | |
9340 there is no text contained within it: From the perspective of the text | |
9341 in the buffer, it does not exist. The gap logically sits at some buffer | |
9342 position, between two characters (or possibly at the beginning or end of | |
9343 the buffer). Insertion of text in a buffer at a particular position is | |
9344 always accomplished by first moving the gap to that position | |
9345 (i.e. through some block moving of text), then writing the text into the | |
9346 beginning of the gap, thereby shrinking the gap. If the gap shrinks | |
9347 down to nothing, a new gap is created. (What actually happens is that a | |
9348 new gap is ``created'' at the end of the buffer's text, which requires | |
9349 nothing more than changing a couple of indices; then the gap is | |
9350 ``moved'' to the position where the insertion needs to take place by | |
9351 moving up in memory all the text after that position.) Similarly, | |
9352 deletion occurs by moving the gap to the place where the text is to be | |
9353 deleted, and then simply expanding the gap to include the deleted text. | |
9354 (@dfn{Expanding} and @dfn{shrinking} the gap as just described means | |
9355 just that the internal indices that keep track of where the gap is | |
9356 located are changed.) | |
9357 | |
9358 Note that the total amount of memory allocated for a buffer text never | |
9359 decreases while the buffer is live. Therefore, if you load up a | |
9360 20-megabyte file and then delete all but one character, there will be a | |
9361 20-megabyte gap, which won't get any smaller (except by inserting | |
9362 characters back again). Once the buffer is killed, the memory allocated | |
9363 for the buffer text will be freed, but it will still be sitting on the | |
9364 heap, taking up virtual memory, and will not be released back to the | |
9365 operating system. (However, if you have compiled XEmacs with rel-alloc, | |
9366 the situation is different. In this case, the space @emph{will} be | |
9367 released back to the operating system. However, this tends to result in a | |
9368 noticeable speed penalty.) | |
9369 | |
9370 Astute readers may notice that the text in a buffer is represented as | |
9371 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is | |
9372 a 19-bit integer, which clearly cannot fit in a byte. This means (of | |
9373 course) that the text in a buffer uses a different representation from | |
9374 an Ichar: specifically, the 19-bit Ichar becomes a series of one to | |
9375 four bytes. The conversion between these two representations is complex | |
9376 and will be described later. | |
9377 | |
9378 In the non-MULE case, everything is very simple: An Ichar | |
9379 is an 8-bit value, which fits neatly into one byte. | |
9380 | |
9381 If we are given a buffer position and want to retrieve the | |
9382 character at that position, we need to follow these steps: | |
9383 | |
9384 @enumerate | |
9385 @item | |
9386 Pretend there's no gap, and convert the buffer position into a @dfn{byte | |
9387 index} that indexes to the appropriate byte in the buffer's stream of | |
9388 textual bytes. By convention, byte indices begin at 1, just like buffer | |
9389 positions. In the non-MULE case, byte indices and buffer positions are | |
9390 identical, since one character equals one byte. | |
9391 @item | |
9392 Convert the byte index into a @dfn{memory index}, which takes the gap | |
9393 into account. The memory index is a direct index into the block of | |
9394 memory that stores the text of a buffer. This basically just involves | |
9395 checking to see if the byte index is past the gap, and if so, adding the | |
9396 size of the gap to it. By convention, memory indices begin at 1, just | |
9397 like buffer positions and byte indices, and when referring to the | |
9398 position that is @dfn{at} the gap, we always use the memory position at | |
9399 the @emph{beginning}, not at the end, of the gap. | |
9400 @item | |
9401 Fetch the appropriate bytes at the determined memory position. | |
9402 @item | |
9403 Convert these bytes into an Ichar. | |
9404 @end enumerate | |
9405 | |
9406 In the non-Mule case, (3) and (4) boil down to a simple one-byte | |
9407 memory access. | |
9408 | |
9409 Note that we have defined three types of positions in a buffer: | |
9410 | |
9411 @enumerate | |
9412 @item | |
9413 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos} | |
9414 @item | |
9415 @dfn{byte indices}, typedef @code{Bytebpos} | |
9416 @item | |
9417 @dfn{memory indices}, typedef @code{Membpos} | |
9418 @end enumerate | |
9419 | |
9420 All three typedefs are just @code{int}s, but defining them this way makes | |
9421 things a lot clearer. | |
9422 | |
9423 Most code works with buffer positions. In particular, all Lisp code | |
9424 that refers to text in a buffer uses buffer positions. Lisp code does | |
9425 not know that byte indices or memory indices exist. | |
9426 | |
9427 Finally, we have a typedef for the bytes in a buffer. This is a | |
9428 @code{Ibyte}, which is an unsigned char. Referring to them as | |
9429 Ibytes underscores the fact that we are working with a string of bytes | |
9430 in the internal Emacs buffer representation rather than in one of a | |
9431 number of possible alternative representations (e.g. EUC-encoded text, | |
9432 etc.). | |
9433 | |
9434 @node Ibytes and Ichars, Byte-Char Position Conversion, The Text in a Buffer, Text | |
9435 @section Ibytes and Ichars | |
9436 @cindex Ibytes and Ichars | |
9437 @cindex Ichars, Ibytes and | |
9438 | |
9439 Not yet documented. | |
9440 | |
9441 @node Byte-Char Position Conversion, Searching and Matching, Ibytes and Ichars, Text | |
9442 @section Byte-Char Position Conversion | |
9443 @cindex byte-char position conversion | |
9444 @cindex position conversion, byte-char | |
9445 @cindex conversion, byte-char position | |
9446 | |
9447 Oct 2004: | |
9448 | |
9449 This is what I wrote when describing the previous algorithm: | |
9450 | |
9451 @quotation | |
9452 The basic algorithm we use is to keep track of a known region of | |
9453 characters in each buffer, all of which are of the same width. We keep | |
9454 track of the boundaries of the region in both Charbpos and Bytebpos | |
9455 coordinates and also keep track of the char width, which is 1 - 4 bytes. | |
9456 If the position we're translating is not in the known region, then we | |
9457 invoke a function to update the known region to surround the position in | |
9458 question. This assumes locality of reference, which is usually the | |
9459 case. | |
9460 | |
9461 Note that the function to update the known region can be simple or | |
9462 complicated depending on how much information we cache. In addition to | |
9463 the known region, we always cache the correct conversions for point, | |
9464 BEGV, and ZV, and in addition to this we cache 16 positions where the | |
9465 conversion is known. We only look in the cache or update it when we | |
9466 need to move the known region more than a certain amount (currently 50 | |
9467 chars), and then we throw away a "random" value and replace it with the | |
9468 newly calculated value. | |
9469 | |
9470 Finally, we maintain an extra flag that tracks whether the buffer is | |
9471 entirely ASCII, to speed up the conversions even more. This flag is | |
9472 actually of dubious value because in an entirely-ASCII buffer the known | |
9473 region will always span the entire buffer (in fact, we update the flag | |
9474 based on this fact), and so all we're saving is a few machine cycles. | |
9475 | |
9476 A potentially smarter method than what we do with known regions and | |
9477 cached positions would be to keep some sort of pseudo-extent layer over | |
9478 the buffer; maybe keep track of the charbpos/bytebpos correspondence at | |
9479 the beginning of each line, which would allow us to do a binary search | |
9480 over the pseudo-extents to narrow things down to the correct line, at | |
9481 which point you could use a linear movement method. This would also | |
9482 mesh well with efficiently implementing a line-numbering scheme. | |
9483 However, you have to weigh the amount of time spent updating the cache | |
9484 vs. the savings that result from it. In reality, we modify the buffer | |
9485 far less often than we access it, so a cache of this sort that provides | |
9486 guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a | |
9487 maximum on the cache size) would indeed be a win, particularly in very | |
9488 large buffers. If we ever implement this, we should probably set a | |
9489 reasonably high minimum below which we use the old method, because the | |
9490 time spent updating the fancy cache would likely become dominant when | |
9491 making buffer modifications in smaller buffers. | |
9492 | |
9493 Note also that we have to multiply or divide by the char width in order | |
9494 to convert the positions. We do some tricks to avoid ever actually | |
9495 having to do a multiply or divide, because that is typically an | |
9496 expensive operation (esp. divide). Multiplying or dividing by 1, 2, or | |
9497 4 can be implemented simply as a shift left or shift right, and we keep | |
9498 track of a shifter value (0, 1, or 2) indicating how much to shift. | |
9499 Multiplying by 3 can be implemented by doubling and then adding the | |
9500 original value. Dividing by 3, alas, cannot be implemented in any | |
9501 simple shift/subtract method, as far as I know; so we just do a table | |
9502 lookup. For simplicity, we use a table of size 128K, which indexes the | |
9503 "divide-by-3" values for the first 64K non-negative numbers. (Note that | |
9504 we can increase the size up to 384K, i.e. indexing the first 192K | |
9505 non-negative numbers, while still using shorts in the array.) This also | |
9506 means that the size of the known region can be at most 64K for | |
9507 width-three characters. | |
9508 @end quotation | |
9509 | |
9510 Unfortunately, it turned out that the implementation had serious problems | |
9511 which had never been corrected. In particular, the known region had a | |
9512 large tendency to become zero-length and stay that way. | |
9513 | |
9514 So I decided to port the algorithm from FSF 21.3, in markers.c. | |
9515 | |
9516 This algorithm is fairly simple. Instead of using markers I kept the cache | |
9517 array of known positions from the previous implementation. | |
9518 | |
9519 Basically, we keep a number of positions cached: | |
9520 | |
9521 @itemize @bullet | |
9522 @item | |
9523 the actual end of the buffer | |
9524 @item | |
9525 the beginning and end of the accessible region | |
9526 @item | |
9527 the value of point | |
9528 @item | |
9529 the position of the gap | |
9530 @item | |
9531 the last value we computed | |
9532 @item | |
9533 a set of positions that are "far away" from previously computed positions | |
9534 (5000 chars currently; #### perhaps should be smaller) | |
9535 @end itemize | |
9536 | |
9537 For each position, we @code{CONSIDER()} it. This means: | |
9538 | |
9539 @itemize @bullet | |
9540 @item | |
9541 If the position is what we're looking for, return it directly. | |
9542 @item | |
9543 Starting with the beginning and end of the buffer, we successively | |
9544 compute the smallest enclosing range of known positions. If at any | |
9545 point we discover that this range has the same byte and char length | |
9546 (i.e. is entirely single-byte), then our computation is trivial. | |
9547 @item | |
9548 If at any point we get a small enough range (50 chars currently), | |
9549 stop considering further positions. | |
9550 @end itemize | |
9551 | |
9552 Otherwise, once we have an enclosing range, see which side is closer, and | |
9553 iterate until we find the desired value. As an optimization, I replaced | |
9554 the simple loop in FSF with the use of @code{bytecount_to_charcount()}, | |
9555 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or | |
9556 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.) | |
9557 These scan 4 or 8 bytes at a time through purely single-byte characters. | |
9558 | |
9559 If the amount we had to scan was more than our "far away" distance (5000 | |
9560 characters, see above), then cache the new position. | |
9561 | |
9562 #### Things to do: | |
9563 | |
9564 @itemize @bullet | |
9565 @item | |
9566 Look at the most recent GNU Emacs to see whether anything has changed. | |
9567 @item | |
9568 Think about whether it makes sense to try to implement some sort of | |
9569 known region or list of "known regions", like we had before. This would | |
9570 be a region of entirely single-byte characters that we can check very | |
9571 quickly. (Previously I used a range of same-width characters of any | |
9572 size; but this adds extra complexity and slows down the scanning, and is | |
9573 probably not worth it.) As part of the scanning process in | |
9574 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely | |
9575 single-byte chars, so it should be easy to remember the last one. | |
9576 Presumably what we should do is keep track of the largest known surrounding | |
9577 entirely-single-byte region for each of the cache positions as well as | |
9578 perhaps the last-cached position. We want to be careful not to get bitten | |
9579 by the previous problem of having the known region getting reset too | |
9580 often. If we implement this, we might well want to continue scanning | |
9581 some distance past the desired position (maybe 300-1000 bytes) if we are | |
9582 in a single-byte range so that we won't end up expanding the known range | |
9583 one position at a time and entering the function each time. | |
9584 @item | |
9585 Think about whether it makes sense to keep the position cache sorted. | |
9586 This would allow it to be larger and finer-grained in its positions. | |
9587 Note that with FSF's use of markers, they were sorted, but this | |
9588 was not really made good use of. With an array, we can do binary searching | |
9589 to quickly find the smallest range. We would probably want to make use of | |
9590 the gap-array code in extents.c. | |
9591 @end itemize | |
9592 | |
9593 Note that FSF's algorithm checked @strong{ALL} markers, not just the ones cached | |
9594 by this algorithm. This includes markers created by the user as well as | |
9595 both ends of any overlays. We could do similarly, and our extents could | |
9596 keep both byte and character positions rather than just the former. (But | |
9597 this would probably be overkill. We should just use our cache instead. | |
9598 Any place an extent was set was surely already visited by the char<-->byte | |
9599 conversion routines.) | |
9600 | |
9601 @node Searching and Matching, , Byte-Char Position Conversion, Text | |
9602 @section Searching and Matching | |
9603 @cindex searching | |
9604 @cindex matching | |
9605 | |
9606 Very incomplete, limited to a brief introduction. | |
9607 | |
9608 People find the searching and matching code difficult to understand. | |
9609 And indeed, the details are hard. However, the basic structures are not | |
9610 so complex. First, there's a hard question with a simple answer. What | |
9611 about Mule? The answer here is that it turns out that Mule characters | |
9612 can be matched byte by byte, so neither the search code nor the regular | |
9613 expression code need take much notice of it at all! Of course, we add | |
9614 some special features (such as regular expressions that match only | |
9615 certain charsets), but these do not require new concepts. The main | |
9616 exception is that wild-card matches in Mule have to be careful to | |
9617 swallow whole characters. This is handled using the same basic macros | |
9618 that are used for buffer and string movements. | |
9619 | |
9620 This will also be true if a UTF-8 representation is used for the | |
9621 internal encoding. | |
9622 | |
9623 The complex algorithms for searching are for simple string searches. In | |
9624 particular, the algorithm used for fast string searching is Boyer-Moore. | |
9625 This algorithm is based on the idea that if you have a mismatch at a | |
9626 given position, you can precompute where to restart the search. This | |
9627 typically means that you can often make many fewer than N character | |
9628 comparisons, where N is the position at which the match is found, or the | |
9629 size of the text if it contains no match. That's fast! But it's not | |
9630 easy. You must ``compile'' the search string into a jump table. See | |
9631 the source, @file{search.c}, for more information. | |
9632 | |
9633 Emacs changes the basic algorithms somewhat in order to handle | |
9634 case-insensitive searches without a full-blown regular expression. | |
9635 | |
9636 Regular expressions, on the other hand, have a trivial search | |
9637 implementation: try a match at each position. (Under POSIX rules, it's | |
9638 a bit more complex, because POSIX requires that you find the | |
9639 @emph{longest} match in the text. This means you keep a record of the | |
9640 best match so far, and find all the matches.) | |
9641 | |
9642 The matching code for regular expressions is quite complex. First, the | |
9643 regular expression itself is compiled. There are two basic approaches | |
9644 that could be taken. The first is to compile the expression into tables | |
9645 to drive a generic finite automaton emulator. This is the approach | |
9646 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi, | |
9647 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka | |
9648 ``The Dragon Book'') as well as being used by the @file{lex} family of | |
9649 lexical analysis engines. | |
9650 | |
9651 Emacs uses a somewhat different technique. The expression is compiled | |
9652 into a form of bytecode, which is interpreted by a special interpreter. | |
9653 The interpreter itself basically amounts to an inline implementation of | |
9654 the finite automaton emulator. The advantage of this technique is that | |
9655 it's easier to add special features, such as control of case-sensitivity | |
9656 via a global variable. | |
9657 | |
9658 The compiler is not treated here. See the source, @file{regex.c}. The | |
9659 interpreter, although it is divided into several functions, and looks | |
9660 fearsomely complex, is actually quite simple in concept. However, | |
9661 basically what you're doing there is a strcmp on steroids, right? | |
9662 | |
9663 @example | |
9664 int | |
9665 strcmp (char *p, /* pattern pointer */ | |
9666 char *b) /* buffer pointer */ | |
9667 @{ | |
9668 while (*p++ == *b++) | |
9669 ; | |
9670 return *(--p) - *(--b); /* oops, we overshot */ | |
9671 @} | |
9672 @end example | |
9673 | |
9674 Really, it's no harder than that. (A bit of a white lie, OK?) | |
9675 | |
9676 How does the regexp code generalize this? | |
9677 | |
9678 @enumerate | |
9679 @item | |
9680 Depending on the pattern, @code{*b} may have a general relationship to | |
9681 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is | |
9682 generalized to include checks for set membership, and context dependent | |
9683 properties. This depends on @code{&*b}. Of course that's meaningless | |
9684 in C, so we use @code{b} directly, instead. | |
9685 | |
9686 @item | |
9687 Although to ensure the algorithm terminates, @code{b} must advance step | |
9688 by step, @code{p} can branch and jump. | |
9689 | |
9690 @item | |
9691 The information returned is much greater, including information about | |
9692 subexpressions. | |
9693 @end enumerate | |
9694 | |
9695 We'll ignore (3). (2) is mostly interesting when compiling the regular | |
9696 expression. Now we have | |
9697 | |
9698 @example | |
9699 @group | |
9700 enum operator_t @{ | |
9701 accept = 0, | |
9702 exact, | |
9703 any, | |
9704 range, | |
9705 group, /* actually, these are probably */ | |
9706 repeat, /* turned into conditional code */ | |
9707 /* etc */ | |
9708 @}; | |
9709 @end group | |
9710 | |
9711 @group | |
9712 enum status_t @{ | |
9713 working = 0, | |
9714 matched, | |
9715 mismatch, | |
9716 end_of_buffer, | |
9717 error | |
9718 @}; | |
9719 @end group | |
9720 | |
9721 @group | |
9722 struct pattern @{ | |
9723 enum operator_t operator; | |
9724 char char_value; | |
9725 boolean range_table[256]; | |
9726 /* etc, etc */ | |
9727 @}; | |
9728 @end group | |
9729 | |
9730 @group | |
9731 char *p, /* pattern pointer */ | |
9732 *b; /* buffer pointer */ | |
9733 | |
9734 enum status_t | |
9735 match (struct pattern *p, char *b) | |
9736 @{ | |
9737 enum status_t done = working; | |
9738 | |
9739 while (!(done = match_1_operator (p, b))) | |
9740 @{ | |
9741 struct pattern *p1 = p; | |
9742 p = next_p (p, b); | |
9743 b = next_b (p1, b); | |
9744 @} | |
9745 return done; | |
9746 @} | |
9747 @end group | |
9748 @end example | |
9749 | |
9750 This format exposes the underlying finite automaton. | |
9751 | |
9752 All of them have the following structure, except that the @samp{next_*} | |
9753 functions decide where to jump (for @samp{p}) and whether or not to | |
9754 increment (for @samp{b}), rather than checking for satisfaction of a | |
9755 matching condition. | |
9756 | |
9757 @example | |
9758 enum status_t | |
9759 match_1_operator (pattern *p, char *b) | |
9760 @{ | |
9761 if (! *b) return end_of_buffer; | |
9762 switch (p->operator) | |
9763 @{ | |
9764 case accept: | |
9765 return matched; | |
9766 case exact: | |
9767 if (*b != p->char_value) return mismatch; else break; | |
9768 case any: | |
9769 break; | |
9770 case range: | |
9771 /* range_table is computed in the regexp_compile function */ | |
9772 if (! p->range_table[*b]) return mismatch; | |
9773 /* etc, etc */ | |
9774 @} | |
9775 return working; | |
9776 @} | |
9777 @end example | |
9778 | |
9779 Grouping, repetition, and alternation are handled by compiling the | |
9780 subexpression and calling @code{match (p->subpattern, b)} recursively. | |
9781 | |
9782 In terms of reading the actual code, there are five optimizations | |
9783 (obfuscations, if you like) that have been done. | |
9784 | |
9785 @enumerate | |
9786 @item | |
9787 An explicit "failure stack" has been substituted for recursion. | |
9788 | |
9789 @item | |
9790 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions | |
9791 are actually inlined into the @code{match} function for efficiency. | |
9792 Then the pointer movement is interspersed with the matching operations. | |
9793 | |
9794 @item | |
9795 If the operator uses buffer context, the buffer pointer movement is | |
9796 sometimes implicit in the operations retrieving the context. | |
9797 | |
9798 @item | |
9799 Some cases are combined into short preparation for individual cases, and | |
9800 a "fall-through" into combined code for several cases. | |
9801 | |
9802 @item | |
9803 The @code{pattern} type is not an explicit @samp{struct}. Instead, the | |
9804 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the | |
9805 compiled bytecode. This leads to bizarre code in the interpreter like | |
9806 | |
9807 @example | |
9808 case range: | |
9809 p += *(p + 1); break; | |
9810 @end example | |
9811 | |
9812 in @code{next_p}, because the compiled pattern is laid out | |
9813 | |
9814 @example | |
9815 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ... | |
9816 @end example | |
9817 @end enumerate | |
9818 | |
9819 But if you keep your eye on the "switch in a loop" structure, you | |
9820 should be able to understand the parts you need. | |
9821 | |
9822 @node Multilingual Support, The Lisp Reader and Compiler, Text, Top | |
9823 @chapter Multilingual Support | |
9824 @cindex Mule character sets and encodings | |
9825 @cindex character sets and encodings, Mule | |
9826 @cindex encodings, Mule character sets and | |
9827 | |
9828 @emph{NOTE}: There is a great deal of overlapping and redundant | |
9829 information in this chapter. Ben wrote introductions to Mule issues a | |
9830 number of times, each time not realizing that he had already written | |
9831 another introduction previously. Hopefully, in time these will all be | |
9832 integrated. | |
9833 | |
9834 @emph{NOTE}: The information at the top of the source file | |
9835 @file{text.c} is more complete than the following, and there is also a | |
9836 list of all other places to look for text/I18N-related info. Also look in | |
9837 @file{text.h} for info about the DFC and Eistring API's. | |
9838 | |
9839 Recall that there are two primary ways that text is represented in | |
9840 XEmacs. The @dfn{buffer} representation sees the text as a series of | |
9841 bytes (Ibytes), with a variable number of bytes used per character. | |
9842 The @dfn{character} representation sees the text as a series of integers | |
9843 (Ichars), one per character. The character representation is a cleaner | |
9844 representation from a theoretical standpoint, and is thus used in many | |
9845 cases when lots of manipulations on a string need to be done. However, | |
9846 the buffer representation is the standard representation used in both | |
9847 Lisp strings and buffers, and because of this, it is the ``default'' | |
9848 representation that text comes in. The reason for using this | |
9849 representation is that it's compact and is compatible with ASCII. | |
9850 | |
9851 @menu | |
9852 * Introduction to Multilingual Issues #1:: | |
9853 * Introduction to Multilingual Issues #2:: | |
9854 * Introduction to Multilingual Issues #3:: | |
9855 * Introduction to Multilingual Issues #4:: | |
9856 * Character Sets:: | |
9857 * Encodings:: | |
9858 * Internal Mule Encodings:: | |
9859 * Byte/Character Types; Buffer Positions; Other Typedefs:: | |
9860 * Internal Text API's:: | |
9861 * Coding for Mule:: | |
9862 * CCL:: | |
9863 * Modules for Internationalization:: | |
9864 @end menu | |
9865 | |
9866 @node Introduction to Multilingual Issues #1, Introduction to Multilingual Issues #2, Multilingual Support, Multilingual Support | |
9867 @section Introduction to Multilingual Issues #1 | |
9868 @cindex introduction to multilingual issues #1 | |
9869 | |
9870 There is an introduction to these issues in the Lisp Reference manual. | |
9871 @xref{Internationalization Terminology,,, lispref, XEmacs Lisp Reference | |
9872 Manual}. Among other documentation that may be of interest to internals | |
9873 programmers is ISO-2022 (@pxref{ISO 2022,,, lispref, XEmacs Lisp | |
9874 Reference Manual}) and CCL (@pxref{CCL,,, lispref, XEmacs Lisp Reference | |
9875 Manual}) | |
9876 | |
9877 @node Introduction to Multilingual Issues #2, Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #1, Multilingual Support | |
9878 @section Introduction to Multilingual Issues #2 | |
9879 @cindex introduction to multilingual issues #2 | |
9880 | |
9881 @subheading Introduction | |
9882 | |
9883 This document covers a number of design issues, problems and proposals | |
9884 with regards to XEmacs MULE. At first we present some definitions and | |
9885 some aspects of the design that have been agreed upon. Then we present | |
9886 some issues and problems that need to be addressed, and then I include a | |
9887 proposal of mine to address some of these issues. When there are other | |
9888 proposals, for example from Olivier, these will be appended to the end | |
9889 of this document. | |
9890 | |
9891 @subheading Definitions and Design Basics | |
9892 | |
9893 First, @dfn{text} is defined to be a series of characters which together | |
9894 defines an utterance or partial utterance in some language. | |
9895 Generally, this language is a human language, but it may also be a | |
9896 computer language if the computer language uses a representation close | |
9897 enough to that of human languages for it to also make sense to call its | |
9898 representation text. Text is opposed to @dfn{binary}, which is a sequence | |
9899 of bytes, representing machine-readable but not human-readable data. | |
9900 A @dfn{byte} is merely a number within a predefined range, which nowadays is | |
9901 nearly always zero to 255. A @dfn{character} is a unit of text. What makes | |
9902 one character different from another is not always clear-cut. It is | |
9903 generally related to the appearance of the character, although perhaps | |
9904 not any possible appearance of that character, but some sort of ideal | |
9905 appearance that is assigned to a character. Whether two characters | |
9906 that look very similar are actually the same depends on various | |
9907 factors such as political ones, such as whether the characters are | |
9908 used to mean similar sorts of things, or behave similarly in similar | |
9909 contexts. In any case, it is not always clearly defined whether two | |
9910 characters are actually the same or not. In practice, however, this | |
9911 is more or less agreed upon. | |
9912 | |
9913 A @dfn{character set} is just that, a set of one or more characters. | |
9914 The set is unique in that there will not be more than one instance of | |
9915 the same character in a character set, and logically is unordered, | |
9916 although an order is often imposed or suggested for the characters in | |
9917 the character set. We can also define an @dfn{order} on a character | |
9918 set, which is a way of assigning a unique number, or possibly a pair of | |
9919 numbers, or a triplet of numbers, or even a set of four or more numbers | |
9920 to each character in the character set. The combination of an order in | |
9921 the character set results in an @dfn{ordered character set}. In an | |
9922 ordered character set, there is an upper limit and a lower limit on the | |
9923 possible values that a character, or that any number within the set of | |
9924 numbers assigned to a character, can take. However, the lower limit | |
9925 does not have to start at zero or one, or anywhere else in particular, | |
9926 nor does the upper limit have to end anywhere particular, and there may | |
9927 be gaps within these ranges such that particular numbers or sets of | |
9928 numbers do not have a corresponding character, even though they are | |
9929 within the upper and lower limits. For example, @dfn{ASCII} defines a | |
9930 very standard ordered character set. It is normally defined to be 94 | |
9931 characters in the range 33 through 126 inclusive on both ends, with | |
9932 every possible character within this range being actually present in the | |
9933 character set. | |
9934 | |
9935 Sometimes the ASCII character set is extended to include what are called | |
9936 @dfn{non-printing characters}. Non-printing characters are characters | |
9937 which instead of really being displayed in a more or less rectangular | |
9938 block, like all other characters, instead indicate certain functions | |
9939 typically related to either control of the display upon which the | |
9940 characters are being displayed, or have some effect on a communications | |
9941 channel that may be currently open and transmitting characters, or may | |
9942 change the meaning of future characters as they are being decoded, or | |
9943 some other similar function. You might say that non-printing characters | |
9944 are somewhat of a hack because they are a special exception to the | |
9945 standard concept of a character as being a printed glyph that has some | |
9946 direct correspondence in the non-computer world. | |
9947 | |
9948 With non-printing characters in mind, the 94-character ordered character | |
9949 set called ASCII is often extended into a 96-character ordered character | |
9950 set, also often called ASCII, which includes in addition to the 94 | |
9951 characters already mentioned, two non-printing characters, one called | |
9952 space and assigned the number 32, just below the bottom of the previous | |
9953 range, and another called @dfn{delete} or @dfn{rubout}, which is given | |
9954 number 127 just above the end of the previous range. Thus to reiterate, | |
9955 the result is a 96-character ordered character set, whose characters | |
9956 take the values from 32 to 127 inclusive. Sometimes ASCII is further | |
9957 extended to contain 32 more non-printing characters, which are given the | |
9958 numbers zero through 31 so that the result is a 128-character ordered | |
9959 character set with characters numbered zero through 127, and with many | |
9960 non-printing characters. Another way to look at this, and the way that | |
9961 is normally taken by XEmacs MULE, is that the characters that would be | |
9962 in the range 30 through 31 in the most extended definition of ASCII, | |
9963 instead form their own ordered character set, which is called | |
9964 @dfn{control zero}, and consists of 32 characters in the range zero | |
9965 through 31. A similar ordered character set called @dfn{control one} is | |
9966 also created, and it contains 32 more non-printing characters in the | |
9967 range 128 through 159. Note that none of these three ordered character | |
9968 sets overlaps in any of the numbers they are assigned to their | |
9969 characters, so they can all be used at once. Note further that the same | |
9970 character can occur in more than one character set. This was shown | |
9971 above, for example, in two different ordered character sets we defined, | |
9972 one of which we could have called @dfn{ASCII}, and the other | |
9973 @dfn{ASCII-extended}, to show that it had extended by two non-printable | |
9974 characters. Most of the characters in these two character sets are | |
9975 shared and present in both of them. | |
9976 | |
9977 Note that there is no restriction on the size of the character set, or | |
9978 on the numbers that are assigned to characters in an ordered character | |
9979 set. It is often extremely useful to represent a sequence of characters | |
9980 as a sequence of bytes, where a byte as defined above is a number in the | |
9981 range zero to 255. An @dfn{encoding} does precisely this. It is simply | |
9982 a mapping from a sequence of characters, possibly augmented with | |
9983 information indicating the character set that each of these characters | |
9984 belongs to, to a sequence of bytes which represents that sequence of | |
9985 characters and no other -- which is to say the mapping is reversible. | |
9986 | |
9987 A @dfn{coding system} is a set of rules for encoding a sequence of | |
9988 characters augmented with character set information into a sequence of | |
9989 bytes, and later performing the reverse operation. It is frequently | |
9990 possible to group coding systems into classes or types based on common | |
9991 features. Typically, for example, a particular coding system class | |
9992 may contain a base coding system which specifies some of the rules, | |
9993 but leaves the rest unspecified. Individual members of the coding | |
9994 system class are formed by starting with the base coding system, and | |
9995 augmenting it with additional rules to produce a particular coding | |
9996 system, what you might think of as a sort of variation within a | |
9997 theme. | |
9998 | |
9999 @subheading XEmacs Specific Definitions | |
10000 | |
10001 First of all, in XEmacs, the concept of character is a little different | |
10002 from the general definition given above. For one thing, the character | |
10003 set that a character belongs to may or may not be an inherent part of | |
10004 the character itself. In other words, the same character occurring in | |
10005 two different character sets may appear in XEmacs as two different | |
10006 characters. This is generally the case now, but we are attempting to | |
10007 move in the other direction. Different proposals may have different | |
10008 ideas about exactly the extent to which this change will be carried out. | |
10009 The general trend, though, is to represent all information about a | |
10010 character other than the character itself, using text properties | |
10011 attached to the character. That way two instances of the same character | |
10012 will look the same to lisp code that merely retrieves the character, and | |
10013 does not also look at the text properties of that character. Everyone | |
10014 involved is in agreement in doing it this way with all Latin characters, | |
10015 and in fact for all characters other than Chinese, Japanese, and Korean | |
10016 ideographs. For those, there may be a difference of opinion. | |
10017 | |
10018 A second difference between the general definition of character and the | |
10019 XEmacs usage of character is that each character is assigned a unique | |
10020 number that distinguishes it from all other characters in the world, or | |
10021 at the very least, from all other characters currently existing anywhere | |
10022 inside the current XEmacs invocation. (If there is a case where the | |
10023 weaker statement applies, but not the stronger statement, it would | |
10024 possibly be with composite characters and any other such characters that | |
10025 are created on the sly.) | |
10026 | |
10027 This unique number is called the @dfn{character representation} of the | |
10028 character, and its particular details are a matter of debate. There is | |
10029 the current standard in use that it is undoubtedly going to change. | |
10030 What has definitely been agreed upon is that it will be an integer, more | |
10031 specifically a positive integer, represented with less than or equal to | |
10032 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit | |
10033 architecture, with the proviso that any characters that whose | |
10034 representation would fit in a 64-bit architecture, but not on a 32-bit | |
10035 architecture, would be used only for composite characters, and others | |
10036 that would satisfy the weak uniqueness property mentioned above, but not | |
10037 with the strong uniqueness property. | |
10038 | |
10039 At this point, it is useful to talk about the different representations | |
10040 that a sequence of characters can take. The simplest representation is | |
10041 simply as a sequence of characters, and this is called the @dfn{Lisp | |
10042 representation} of text, because it is the representation that Lisp | |
10043 programs see. Other representations include the external | |
10044 representation, which refers to any encoding of the sequence of | |
10045 characters, using the definition of encoding mentioned above. | |
10046 Typically, text in the external representation is used outside of | |
10047 XEmacs, for example in files, e-mail messages, web sites, and the like. | |
10048 Another representation for a sequence of characters is what I will call | |
10049 the @dfn{byte representation}, and it represents the way that XEmacs | |
10050 internally represents text in a buffer, or in a string. Potentially, | |
10051 the representation could be different between a buffer and a string, and | |
10052 then the terms @dfn{buffer byte representation} and @dfn{string byte | |
10053 representation} would be used, but in practice I don't think this will | |
10054 occur. It will be possible, of course, for buffers and strings, or | |
10055 particular buffers and particular strings, to contain different | |
10056 sub-representations of a single representation. For example, Olivier's | |
10057 1-2-4 proposal allows for three sub-representations of his internal byte | |
10058 representation, allowing for 1 byte, 2 bytes, and 4 byte width | |
10059 characters respectively. A particular string may be in one | |
10060 sub-representation, and a particular buffer in another | |
10061 sub-representation, but overall both are following the same byte | |
10062 representation. I do not use the term @dfn{internal representation} | |
10063 here, as many people have, because it is potentially ambiguous. | |
10064 | |
10065 Another representation is called the @dfn{array of characters | |
10066 representation}. This is a representation on the C-level in which the | |
10067 sequence of text is represented, not using the byte representation, but | |
10068 by using an array of characters, each represented using the character | |
10069 representation. This sort of representation is often used by redisplay | |
10070 because it is more convenient to work with than any of the other | |
10071 internal representations. | |
10072 | |
10073 The term @dfn{binary representation} may also be heard. Binary | |
10074 representation is used to represent binary data. When binary data is | |
10075 represented in the lisp representation, an equivalence is simply set up | |
10076 between bytes zero through 255, and characters zero through 255. These | |
10077 characters come from four character sets, which are from bottom to top, | |
10078 control zero, ASCII, control 1, and Latin 1. Together, they comprise | |
10079 256 characters, and are a good mapping for the 256 possible bytes in a | |
10080 binary representation. Binary representation could also be used to | |
10081 refer to an external representation of the binary data, which is a | |
10082 simple direct byte-to-byte representation. No internal representation | |
10083 should ever be referred to as a binary representation because of | |
10084 ambiguity. The terms character set/encoding system were defined | |
10085 generally, above. In XEmacs, the equivalent concepts exist, although | |
10086 character set has been shortened to charset, and in fact represents | |
10087 specifically an ordered character set. For each possible charset, and | |
10088 for each possible coding system, there is an associated object in | |
10089 XEmacs. These objects will be of type charset and coding system, | |
10090 respectively. Charsets and coding systems are divided into classes, or | |
10091 @dfn{types}, the normal term under XEmacs, and all possible charsets | |
10092 encoding systems that may be defined must be in one of these types. If | |
10093 you need to create a charset or coding system that is not one of these | |
10094 types, you will have to modify the C code to support this new type. | |
10095 Some of the existing or soon-to-be-created types are, or will be, | |
10096 generic enough so that this shouldn't be an issue. Note also that the | |
10097 byte encoding for text and the character coding of a character are | |
10098 closely related. You might say that ideally each is the simplest | |
10099 equivalent of the other given the general constraints on each | |
10100 representation. | |
10101 | |
10102 To be specific, in the current MULE representation, | |
10103 | |
10104 @enumerate | |
10105 @item | |
10106 Characters encode both the character itself and the character set | |
10107 that it comes from. These character sets are always assumed to be | |
10108 representable as an ordered character set of size 96 or of size 96 | |
10109 by 96, or the trivially-related sizes 94 and 94 by 94. The only | |
10110 allowable exceptions are the control zero and control one character | |
10111 sets, which are of size 32. Character sets which do not naturally | |
10112 have a compatible ordering such as this are shoehorned into an | |
10113 ordered character set, or possibly two ordered character sets of a | |
10114 compatible size. | |
10115 @item | |
10116 The variable width byte representation was deliberately chosen to | |
10117 allow scanning text forwards and backwards efficiently. This | |
10118 necessitated defining the possible bytes into three ranges which | |
10119 we shall call A, B, and C. Range A is used exclusively for | |
10120 single-byte characters, which is to say characters that are | |
10121 representing using only one contiguous byte. Multi-byte | |
10122 characters are always represented by using one byte from Range B, | |
10123 followed by one or more bytes from Range C. What this means is | |
10124 that bytes that begin a character are unequivocally distinguished | |
10125 from bytes that do not begin a character, and therefore there is | |
10126 never a problem scaling backwards and finding the beginning of a | |
10127 character. Know that UTF8 adopts a proposal that is very similar | |
10128 in spirit in that it uses separate ranges for the first byte of a | |
10129 multi byte sequence, and the following bytes in multi-byte | |
10130 sequence. | |
10131 @item | |
10132 Given the fact that all ordered character sets allowed were | |
10133 essentially 96 characters per dimension, it made perfect sense to | |
10134 make Range C comprise 96 bytes. With a little more tweaking, the | |
10135 currently-standard MULE byte representation was created, and was | |
10136 drafted from this. | |
10137 @item | |
10138 The MULE byte representation defined four basic representations for | |
10139 characters, which would take up from one to four bytes, | |
10140 respectively. The MULE character representation thus had the | |
10141 following constraints: | |
10142 @enumerate | |
10143 @item | |
10144 Character numbers zero through 255 should represent the | |
10145 characters that binary values zero through 255 would be | |
10146 mapped onto. (Note: this was not the case in Kenichi Handa's | |
10147 version of this representation, but I changed it.) | |
10148 @item | |
10149 The four sub-classes of representation in the MULE byte | |
10150 representation should correspond to four contiguous | |
10151 non-overlapping ranges of characters. | |
10152 @item | |
10153 The algorithmic conversion between the single character | |
10154 represented in the byte representation and in the character | |
10155 representation should be as easy as possible. | |
10156 @item | |
10157 Given the previous constraints, the character representation | |
10158 should be as compact as possible, which is to say it should | |
10159 use the least number of bits possible. | |
10160 @end enumerate | |
10161 @end enumerate | |
10162 | |
10163 So you see that the entire structure of the byte and character | |
10164 representations stemmed from a very small number of basic choices, | |
10165 which were | |
10166 | |
10167 @enumerate | |
10168 @item | |
10169 the choice to encode character set information in a character | |
10170 @item | |
10171 the choice to assume that all character sets would have an order | |
10172 imposed upon them with 96 characters per one or two | |
10173 dimensions. (This is less arbitrary than it seems--it follows | |
10174 ISO-2022) | |
10175 @item | |
10176 the choice to use a variable width byte representation. | |
10177 @end enumerate | |
10178 | |
10179 What this means is that you cannot really separate the byte | |
10180 representation, the character representation, and the assumptions made | |
10181 about characters and whether they represent character sets from each | |
10182 other. All of these are closely intertwined, and for purposes of | |
10183 simplicity, they should be designed together. If you change one | |
10184 representation without changing another, you are in essence creating a | |
10185 completely new design with its own attendant problems--since your new | |
10186 design is likely to be quite complex and not very coherent with | |
10187 regards to the translation between the character and byte | |
10188 representations, you are likely to run into problems. | |
10189 | |
10190 @node Introduction to Multilingual Issues #3, Introduction to Multilingual Issues #4, Introduction to Multilingual Issues #2, Multilingual Support | |
10191 @section Introduction to Multilingual Issues #3 | |
10192 @cindex introduction to multilingual issues #3 | |
10193 | |
10194 In XEmacs, Mule is a code word for the support for input handling and | |
10195 display of multi-lingual text. This section provides an overview of how | |
10196 this support impacts the C and Lisp code in XEmacs. It is important for | |
10197 anyone who works on the C or the Lisp code, especially on the C code, to | |
10198 be aware of these issues, even if they don't work directly on code that | |
10199 implements multi-lingual features, because there are various general | |
10200 procedures that need to be followed in order to write Mule-compliant | |
10201 code. (The specifics of these procedures are documented elsewhere in | |
10202 this manual.) | |
10203 | |
10204 There are four primary aspects of Mule support: | |
10205 | |
10206 @enumerate | |
10207 @item | |
10208 internal handling and representation of multi-lingual text. | |
10209 @item | |
10210 conversion between the internal representation of text and the various | |
10211 external representations in which multi-lingual text is encoded, such as | |
10212 Unicode representations (including mostly fixed width encodings such as | |
10213 UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings, | |
10214 such as UTF-7 and UTF-8); the various ISO2022 representations, which | |
10215 typically use escape sequences to switch between different character | |
10216 sets (such as Compound Text, used under X Windows; JIS, used | |
10217 specifically for encoding Japanese; and EUC, a non-modal encoding used | |
10218 for Japanese, Korean, and certain other languages); Microsoft's | |
10219 multi-byte encodings (such as Shift-JIS); various simple encodings for | |
10220 particular 8-bit character sets (such as Latin-1 and Latin-2, and | |
10221 encodings (such as koi8 and Alternativny) for Cyrillic); and others. | |
10222 This conversion needs to happen both for text in files and text sent to | |
10223 or retrieved from system API calls. It even needs to happen for | |
10224 external binary data because the internal representation does not | |
10225 represent binary data simply as a sequence of bytes as it is represented | |
10226 externally. | |
10227 @item | |
10228 Proper display of multi-lingual characters. | |
10229 @item | |
10230 Input of multi-lingual text using the keyboard. | |
10231 @end enumerate | |
10232 | |
10233 These four aspects are for the most part independent of each other. | |
10234 | |
10235 @subheading Characters, Character Sets, and Encodings | |
10236 | |
10237 A @dfn{character} (which is, BTW, a surprisingly complex concept) is, in | |
10238 a written representation of text, the most basic written unit that has a | |
10239 meaning of its own. It's comparable to a phoneme when analyzing words | |
10240 in spoken speech (for example, the sound of @samp{t} in English, which | |
10241 in fact has different pronunciations in different words -- aspirated in | |
10242 @samp{time}, unaspirated in @samp{stop}, unreleased or even pronounced | |
10243 as a glottal stop in @samp{button}, etc. -- but logically is a single | |
10244 concept). Like a phoneme, a character is an abstract concept defined by | |
10245 its @emph{meaning}. The character @samp{lowercase f}, for example, can | |
10246 always be used to represent the first letter in the word @samp{fill}, | |
10247 regardless of whether it's drawn upright or italic, whether the | |
10248 @samp{fi} combination is drawn as a single ligature, whether there are | |
10249 serifs on the bottom of the vertical stroke, etc. (These different | |
10250 appearances of a single character are often called @dfn{graphs} or | |
10251 @dfn{glyphs}.) Our concern when representing text is on representing the | |
10252 abstract characters, and not on their exact appearance. | |
10253 | |
10254 A @dfn{character set} (or @dfn{charset}), as we define it, is a set of | |
10255 characters, each with an associated number (or set of numbers -- see | |
10256 below), called a @dfn{code point}. It's important to understand that a | |
10257 character is not defined by any number attached to it, but by its | |
10258 meaning. For example, ASCII and EBCDIC are two charsets containing | |
10259 exactly the same characters (lowercase and uppercase letters, numbers 0 | |
10260 through 9, particular punctuation marks) but with different | |
10261 numberings. The `comma' character in ASCII and EBCDIC, for instance, is | |
10262 the same character despite having a different numbering. Conversely, | |
10263 when comparing ASCII and JIS-Roman, which look the same except that the | |
10264 latter has a yen sign substituted for the backslash, we would say that | |
10265 the backslash and yen sign are @strong{not} the same characters, despite having | |
10266 the same number (95) and despite the fact that all other characters are | |
10267 present in both charsets, with the same numbering. ASCII and JIS-Roman, | |
10268 then, do @emph{not} have exactly the same characters in them (ASCII has | |
10269 a backslash character but no yen-sign character, and vice-versa for | |
10270 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII | |
10271 and JIS-Roman are closer. | |
10272 | |
10273 It's also important to distinguish between charsets and encodings. For | |
10274 a simple charset like ASCII, there is only one encoding normally used -- | |
10275 each character is represented by a single byte, with the same value as | |
10276 its code point. For more complicated charsets, however, things are not | |
10277 so obvious. Unicode version 2, for example, is a large charset with | |
10278 thousands of characters, each indexed by a 16-bit number, often | |
10279 represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One | |
10280 obvious encoding uses two bytes per character (actually two encodings, | |
10281 depending on which of the two possible byte orderings is chosen). This | |
10282 encoding is convenient for internal processing of Unicode text; however, | |
10283 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is | |
10284 usually used for external text, for example files or e-mail. UTF-8 | |
10285 represents Unicode characters with one to three bytes (often extended to | |
10286 six bytes to handle characters with up to 31-bit indices). Unicode | |
10287 characters 00 to 7F (identical with ASCII) are directly represented with | |
10288 one byte, and other characters with two or more bytes, each in the range | |
10289 80 to FF. | |
10290 | |
10291 In general, a single encoding may be able to represent more than one | |
10292 charset. | |
10293 | |
10294 @subheading Internal Representation of Text | |
10295 | |
10296 In an ASCII or single-European-character-set world, life is very simple. | |
10297 There are 256 characters, and each character is represented using the | |
10298 numbers 0 through 255, which fit into a single byte. With a few | |
10299 exceptions (such as case-changing operations or syntax classes like | |
10300 'whitespace'), "text" is simply an array of indices into a font. You | |
10301 can get different languages simply by choosing fonts with different | |
10302 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and | |
10303 everything will "just work" as long as anyone else receiving your text | |
10304 uses a compatible font. | |
10305 | |
10306 In the multi-lingual world, however, it is much more complicated. There | |
10307 are a great number of different characters which are organized in a | |
10308 complex fashion into various character sets. The representation to use | |
10309 is not obvious because there are issues of size versus speed to | |
10310 consider. In fact, there are in general two kinds of representations to | |
10311 work with: one that represents a single character using an integer | |
10312 (possibly a byte), and the other representing a single character as a | |
10313 sequence of bytes. The former representation is normally called fixed | |
10314 width, and the other variable width. Both representations represent | |
10315 exactly the same characters, and the conversion from one representation | |
10316 to the other is governed by a specific formula (rather than by table | |
10317 lookup) but it may not be simple. Most C code need not, and in fact | |
10318 should not, know the specifics of exactly how the representations work. | |
10319 In fact, the code must not make assumptions about the representations. | |
10320 This means in particular that it must use the proper macros for | |
10321 retrieving the character at a particular memory location, determining | |
10322 how many characters are present in a particular stretch of text, and | |
10323 incrementing a pointer to a particular character to point to the | |
10324 following character, and so on. It must not assume that one character | |
10325 is stored using one byte, or even using any particular number of bytes. | |
10326 It must not assume that the number of characters in a stretch of text | |
10327 bears any particular relation to a number of bytes in that stretch. It | |
10328 must not assume that the character at a particular memory location can | |
10329 be retrieved simply by dereferencing the memory location, even if a | |
10330 character is known to be ASCII or is being compared with an ASCII | |
10331 character, etc. Careful coding is required to be Mule clean. The | |
10332 biggest work of adding Mule support, in fact, is converting all of the | |
10333 existing code to be Mule clean. | |
10334 | |
10335 Lisp code is mostly unaffected by these concerns. Text in strings and | |
10336 buffers appears simply as a sequence of characters regardless of | |
10337 whether Mule support is present. The biggest difference with older | |
10338 versions of Emacs, as well as current versions of GNU Emacs, is that | |
10339 integers and characters are no longer equivalent, but are separate | |
10340 Lisp Object types. | |
10341 | |
10342 @subheading Conversion Between Internal and External Representations | |
10343 | |
10344 All text needs to be converted to an external representation before being | |
10345 sent to a function or file, and all text retrieved from a function of | |
10346 file needs to be converted to the internal representation. This | |
10347 conversion needs to happen as close to the source or destination of the | |
10348 text as possible. No operations should ever be performed on text encoded | |
10349 in an external representation other than simple copying, because no | |
10350 assumptions can reliably be made about the format of this text. You | |
10351 cannot assume, for example, that the end of text is terminated by a null | |
10352 byte. (For example, if the text is Unicode, it will have many null bytes | |
10353 in it.) You cannot find the next "slash" character by searching through | |
10354 the bytes until you find a byte that looks like a "slash" character, | |
10355 because it might actually be the second byte of a Kanji character. | |
10356 Furthermore, all text in the internal representation must be converted, | |
10357 even if it is known to be completely ASCII, because the external | |
10358 representation may not be ASCII compatible (for example, if it is | |
10359 Unicode). | |
10360 | |
10361 The place where C code needs to be the most careful is when calling | |
10362 external API functions. It is easy to forget that all text passed to or | |
10363 retrieved from these functions needs to be converted. This includes text | |
10364 in structures passed to or retrieved from these functions and all text | |
10365 that is passed to a callback function that is called by the system. | |
10366 | |
10367 Macros are provided to perform conversions to or from external text. | |
10368 These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT | |
10369 respectively. These macros accept input in various forms, for example, | |
10370 Lisp strings, buffers, lstreams, raw data, and can return data in | |
10371 multiple formats, including both @code{malloc()}ed and @code{alloca()}ed data. The use | |
10372 of @code{alloca()}ed data here is particularly important because, in general, | |
10373 the returned data will not be used after making the API call, and as a | |
10374 result, using @code{alloca()}ed data provides a very cheap and easy to use | |
10375 method of allocation. | |
10376 | |
10377 These macros take a coding system argument which indicates the nature of | |
10378 the external encoding. A coding system is an object that encapsulates | |
10379 the structures of a particular external encoding and the methods required | |
10380 to convert to and from this encoding. A facility exists to create coding | |
10381 system aliases, which in essence gives a single coding system two | |
10382 different names. It is effectively used in XEmacs to provide a layer of | |
10383 abstraction on top of the actual coding systems. For example, the coding | |
10384 system alias "file-name" points to whichever coding system is currently | |
10385 used for encoding and decoding file names as passed to or retrieved from | |
10386 system calls. In general, the actual encoding will differ from system to | |
10387 system, and also on the particular locale that the user is in. The use | |
10388 of the file-name alias effectively hides that implementation detail on | |
10389 top of that abstract interface layer which provides a unified set of | |
10390 coding systems which are consistent across all operating environments. | |
10391 | |
10392 The choice of which coding system to use in a particular conversion macro | |
10393 requires some thought. In general, you should choose a lower-level | |
10394 actual coding system when the very design of the APIs you are working | |
10395 with call for that particular coding system. In all other cases, you | |
10396 should find the least general abstract coding system (i.e. coding system | |
10397 alias) that applies to your specific situation. Only use the most | |
10398 general coding systems, such as native, when there is simply nothing else | |
10399 that is more appropriate. By doing things this way, you allow the user | |
10400 more control over how the encoding actually works, because the user is | |
10401 free to map the abstracted coding system names onto to different actual | |
10402 coding systems. | |
10403 | |
10404 Some common coding systems are: | |
10405 | |
10406 @table @code | |
10407 @item ctext | |
10408 Compound Text, which is the standard encoding under X Windows, which is | |
10409 used for clipboard data and possibly other data. (ctext is a coding | |
10410 system of type ISO2022.) | |
10411 | |
10412 @item mswindows-unicode | |
10413 this is used for representing text passed to MS Window API calls with | |
10414 arguments that need to be in Unicode format. (mswindows-unicode is a | |
10415 coding system of type UTF-16) | |
10416 | |
10417 @item ms-windows-multi-byte | |
10418 this is used for representing text passed to MS Windows API calls with | |
10419 arguments that need to be in multi-byte format. Note that there are | |
10420 very few if any examples of such calls. | |
10421 | |
10422 @item mswindows-tstr | |
10423 this is used for representing text passed to any MS Windows API calls | |
10424 that declare their argument as LPTSTR, or LPCTSTR. This is the vast | |
10425 majority of system calls and automatically translates either to | |
10426 mswindows-unicode or mswindows-multi-byte, depending on the presence or | |
10427 absence of the UNICODE preprocessor constant. (If we compile XEmacs | |
10428 with this preprocessor constant, then all API calls use Unicode for all | |
10429 text passed to or received from these API calls.) | |
10430 | |
10431 @item terminal | |
10432 used for text sent to or read from a text terminal in the absence of a | |
10433 more specific coding system (calls to window-system specific APIs should | |
10434 use the appropriate window-specific coding system if it makes sense to | |
10435 do so.) | |
10436 | |
10437 @item file-name | |
10438 used when specifying the names of files in the absence of a more | |
10439 specific encoding, such as ms-windows-tstr. | |
10440 | |
10441 @item native | |
10442 the most general coding system for specifying text passed to system | |
10443 calls. This generally translates to whatever coding system is specified | |
10444 by the current locale. This should only be used when none of the coding | |
10445 systems mentioned above are appropriate. | |
10446 @end table | |
10447 | |
10448 @subheading Proper Display of Multilingual Text | |
10449 | |
10450 There are two things required to get this working correctly. One is | |
10451 selecting the correct font, and the other is encoding the text according | |
10452 to the encoding used for that specific font, or the window-system | |
10453 specific text display API. Generally each separate character set has a | |
10454 different font associated with it, which is specified by name and each | |
10455 font has an associated encoding into which the characters must be | |
10456 translated. (this is the case on X Windows, at least; on Windows there | |
10457 is a more general mechanism). Both the specific font for a charset and | |
10458 the encoding of that font are system dependent. Currently there is a | |
10459 way of specifying these two properties under X Windows (using the | |
10460 registry and ccl properties of a character set) but not for other window | |
10461 systems. A more general system needs to be implemented to allow these | |
10462 characteristics to be specified for all Windows systems. | |
10463 | |
10464 Another issue is making sure that the necessary fonts for displaying | |
10465 various character sets are installed on the system. Currently, XEmacs | |
10466 provides, on its web site, X Windows fonts for a number of different | |
10467 character sets that can be installed by users. This isn't done yet for | |
10468 Windows, but it should be. | |
10469 | |
10470 @subheading Inputting of Multilingual Text | |
10471 | |
10472 This is a rather complicated issue because there are many paradigms | |
10473 defined for inputting multi-lingual text, some of which are specific to | |
10474 particular languages, and any particular language may have many | |
10475 different paradigms defined for inputting its text. These paradigms are | |
10476 encoded in input methods and there is a standard API for defining an | |
10477 input method in XEmacs called LEIM, or Library of Emacs Input Methods. | |
10478 Some of these input methods are written entirely in Elisp, and thus are | |
10479 system-independent, while others require the aid either of an external | |
10480 process, or of C level support that ties into a particular | |
10481 system-specific input method API, for example, XIM under X Windows, or | |
10482 the active keyboard layout and IME support under Windows. Currently, | |
10483 there is no support for any system-specific input methods under | |
10484 Microsoft Windows, although this will change. | |
10485 | |
10486 @node Introduction to Multilingual Issues #4, Character Sets, Introduction to Multilingual Issues #3, Multilingual Support | |
10487 @section Introduction to Multilingual Issues #4 | |
10488 @cindex introduction to multilingual issues #4 | |
10489 | |
10490 The rest of the sections in this chapter consist of yet another | |
10491 introduction to multilingual issues, duplicating the information in the | |
10492 previous sections. | |
10493 | |
10494 @node Character Sets, Encodings, Introduction to Multilingual Issues #4, Multilingual Support | |
10495 @section Character Sets | |
10496 @cindex character sets | |
10497 | |
10498 A @dfn{character set} (or @dfn{charset}) is an ordered set of | |
10499 characters. A particular character in a charset is indexed using one or | |
10500 more @dfn{position codes}, which are non-negative integers. The number | |
10501 of position codes needed to identify a particular character in a charset | |
10502 is called the @dfn{dimension} of the charset. In XEmacs/Mule, all | |
10503 charsets have dimension 1 or 2, and the size of all charsets (except for | |
10504 a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range | |
10505 of position codes used to index characters from any of these types of | |
10506 character sets is as follows: | |
10507 | |
10508 @example | |
10509 Charset type Position code 1 Position code 2 | |
10510 ------------------------------------------------------------ | |
10511 94 33 - 126 N/A | |
10512 96 32 - 127 N/A | |
10513 94x94 33 - 126 33 - 126 | |
10514 96x96 32 - 127 32 - 127 | |
10515 @end example | |
10516 | |
10517 Note that in the above cases position codes do not start at an | |
10518 expected value such as 0 or 1. The reason for this will become clear | |
10519 later. | |
10520 | |
10521 For example, Latin-1 is a 96-character charset, and JISX0208 (the | |
10522 Japanese national character set) is a 94x94-character charset. | |
10523 | |
10524 [Note that, although the ranges above define the @emph{valid} position | |
10525 codes for a charset, some of the slots in a particular charset may in | |
10526 fact be empty. This is the case for JISX0208, for example, where (e.g.) | |
10527 all the slots whose first position code is in the range 118 - 127 are | |
10528 empty.] | |
10529 | |
10530 There are three charsets that do not follow the above rules. All of | |
10531 them have one dimension, and have ranges of position codes as follows: | |
10532 | |
10533 @example | |
10534 Charset name Position code 1 | |
10535 ------------------------------------ | |
10536 ASCII 0 - 127 | |
10537 Control-1 0 - 31 | |
10538 Composite 0 - some large number | |
10539 @end example | |
10540 | |
10541 (The upper bound of the position code for composite characters has not | |
10542 yet been determined, but it will probably be at least 16,383). | |
10543 | |
10544 ASCII is the union of two subsidiary character sets: Printing-ASCII | |
10545 (the printing ASCII character set, consisting of position codes 33 - | |
10546 126, like for a standard 94-character charset) and Control-ASCII (the | |
10547 non-printing characters that would appear in a binary file with codes 0 | |
10548 - 32 and 127). | |
10549 | |
10550 Control-1 contains the non-printing characters that would appear in a | |
10551 binary file with codes 128 - 159. | |
10552 | |
10553 Composite contains characters that are generated by overstriking one | |
10554 or more characters from other charsets. | |
10555 | |
10556 Note that some characters in ASCII, and all characters in Control-1, | |
10557 are @dfn{control} (non-printing) characters. These have no printed | |
10558 representation but instead control some other function of the printing | |
10559 (e.g. TAB or 8 moves the current character position to the next tab | |
10560 stop). All other characters in all charsets are @dfn{graphic} | |
10561 (printing) characters. | |
10562 | |
10563 When a binary file is read in, the bytes in the file are assigned to | |
10564 character sets as follows: | |
10565 | |
10566 @example | |
10567 Bytes Character set Range | |
10568 -------------------------------------------------- | |
10569 0 - 127 ASCII 0 - 127 | |
10570 128 - 159 Control-1 0 - 31 | |
10571 160 - 255 Latin-1 32 - 127 | |
10572 @end example | |
10573 | |
10574 This is a bit ad-hoc but gets the job done. | |
10575 | |
10576 @node Encodings, Internal Mule Encodings, Character Sets, Multilingual Support | |
10577 @section Encodings | |
10578 @cindex encodings, Mule | |
10579 @cindex Mule encodings | |
10580 | |
10581 An @dfn{encoding} is a way of numerically representing characters from | |
10582 one or more character sets. If an encoding only encompasses one | |
10583 character set, then the position codes for the characters in that | |
10584 character set could be used directly. This is not possible, however, if | |
10585 more than one character set is to be used in the encoding. | |
10586 | |
10587 For example, the conversion detailed above between bytes in a binary | |
10588 file and characters is effectively an encoding that encompasses the | |
10589 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit | |
10590 bytes. | |
10591 | |
10592 Thus, an encoding can be viewed as a way of encoding characters from a | |
10593 specified group of character sets using a stream of bytes, each of which | |
10594 contains a fixed number of bits (but not necessarily 8, as in the common | |
10595 usage of ``byte''). | |
10596 | |
10597 Here are descriptions of a couple of common | |
10598 encodings: | |
10599 | |
10600 @menu | |
10601 * Japanese EUC (Extended Unix Code):: | |
10602 * JIS7:: | |
10603 @end menu | |
10604 | |
10605 @node Japanese EUC (Extended Unix Code), JIS7, Encodings, Encodings | |
10606 @subsection Japanese EUC (Extended Unix Code) | |
10607 @cindex Japanese EUC (Extended Unix Code) | |
10608 @cindex EUC (Extended Unix Code), Japanese | |
10609 @cindex Extended Unix Code, Japanese EUC | |
10610 | |
10611 This encompasses the character sets Printing-ASCII, Katakana-JISX0201 | |
10612 (half-width katakana, the right half of JISX0201), Japanese-JISX0208, | |
10613 and Japanese-JISX0212. | |
10614 | |
10615 Note that Printing-ASCII and Katakana-JISX0201 are 94-character | |
10616 charsets, while Japanese-JISX0208 and Japanese-JISX0212 are | |
10617 94x94-character charsets. | |
10618 | |
10619 The encoding is as follows: | |
10620 | |
10621 @example | |
10622 Character set Representation (PC=position-code) | |
10623 ------------- -------------- | |
10624 Printing-ASCII PC1 | |
10625 Katakana-JISX0201 0x8E | PC1 + 0x80 | |
10626 Japanese-JISX0208 PC1 + 0x80 | PC2 + 0x80 | |
10627 Japanese-JISX0212 PC1 + 0x80 | PC2 + 0x80 | |
10628 @end example | |
10629 | |
10630 Note that there are other versions of EUC for other Asian languages. | |
10631 EUC in general is characterized by | |
10632 | |
10633 @enumerate | |
10634 @item | |
10635 row-column encoding, | |
10636 @item | |
10637 big-endian (row-first) ordering, and | |
10638 @item | |
10639 ASCII compatibility in variable width forms. | |
10640 @end enumerate | |
10641 | |
10642 @node JIS7, , Japanese EUC (Extended Unix Code), Encodings | |
10643 @subsection JIS7 | |
10644 @cindex JIS7 | |
10645 | |
10646 This encompasses the character sets Printing-ASCII, | |
10647 Latin-JISX0201 (the left half of JISX0201; this character set | |
10648 is very similar to Printing-ASCII and is a 94-character charset), | |
10649 Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes. | |
10650 | |
10651 Unlike EUC, this is a @dfn{modal} encoding, which means that there are | |
10652 multiple states that the encoding can be in, which affect how the bytes | |
10653 are to be interpreted. Special sequences of bytes (called @dfn{escape | |
10654 sequences}) are used to change states. | |
10655 | |
10656 The encoding is as follows: | |
10657 | |
10658 @example | |
10659 Character set Representation (PC=position-code) | |
10660 ------------- -------------- | |
10661 Printing-ASCII PC1 | |
10662 Latin-JISX0201 PC1 | |
10663 Katakana-JISX0201 PC1 | |
10664 Japanese-JISX0208 PC1 | PC2 | |
10665 | |
10666 | |
10667 Escape sequence ASCII equivalent Meaning | |
10668 --------------- ---------------- ------- | |
10669 0x1B 0x28 0x4A ESC ( J invoke Latin-JISX0201 | |
10670 0x1B 0x28 0x49 ESC ( I invoke Katakana-JISX0201 | |
10671 0x1B 0x24 0x42 ESC $ B invoke Japanese-JISX0208 | |
10672 0x1B 0x28 0x42 ESC ( B invoke Printing-ASCII | |
10673 @end example | |
10674 | |
10675 Initially, Printing-ASCII is invoked. | |
10676 | |
10677 @node Internal Mule Encodings, Byte/Character Types; Buffer Positions; Other Typedefs, Encodings, Multilingual Support | |
10678 @section Internal Mule Encodings | |
10679 @cindex internal Mule encodings | |
10680 @cindex Mule encodings, internal | |
10681 @cindex encodings, internal Mule | |
10682 | |
10683 In XEmacs/Mule, each character set is assigned a unique number, called a | |
10684 @dfn{leading byte}. This is used in the encodings of a character. | |
10685 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has | |
10686 a leading byte of 0), although some leading bytes are reserved. | |
10687 | |
10688 Charsets whose leading byte is in the range 0x80 - 0x9F are called | |
10689 @dfn{official} and are used for built-in charsets. Other charsets are | |
10690 called @dfn{private} and have leading bytes in the range 0xA0 - 0xFF; | |
10691 these are user-defined charsets. | |
10692 | |
10693 More specifically: | |
10694 | |
10695 @example | |
10696 Character set Leading byte | |
10697 ------------- ------------ | |
10698 ASCII 0 (0x7F in arrays indexed by leading byte) | |
10699 Composite 0x8D | |
10700 Dimension-1 Official 0x80 - 0x8C/0x8D | |
10701 (0x8E is free) | |
10702 Control 0x8F | |
10703 Dimension-2 Official 0x90 - 0x99 | |
10704 (0x9A - 0x9D are free) | |
10705 Dimension-1 Private Marker 0x9E | |
10706 Dimension-2 Private Marker 0x9F | |
10707 Dimension-1 Private 0xA0 - 0xEF | |
10708 Dimension-2 Private 0xF0 - 0xFF | |
10709 @end example | |
10710 | |
10711 There are two internal encodings for characters in XEmacs/Mule. One is | |
10712 called @dfn{string encoding} and is an 8-bit encoding that is used for | |
10713 representing characters in a buffer or string. It uses 1 to 4 bytes per | |
10714 character. The other is called @dfn{character encoding} and is a 19-bit | |
10715 encoding that is used for representing characters individually in a | |
10716 variable. | |
10717 | |
10718 (In the following descriptions, we'll ignore composite characters for | |
10719 the moment. We also give a general (structural) overview first, | |
10720 followed later by the exact details.) | |
10721 | |
10722 @menu | |
10723 * Internal String Encoding:: | |
10724 * Internal Character Encoding:: | |
10725 @end menu | |
10726 | |
10727 @node Internal String Encoding, Internal Character Encoding, Internal Mule Encodings, Internal Mule Encodings | |
10728 @subsection Internal String Encoding | |
10729 @cindex internal string encoding | |
10730 @cindex string encoding, internal | |
10731 @cindex encoding, internal string | |
10732 | |
10733 ASCII characters are encoded using their position code directly. Other | |
10734 characters are encoded using their leading byte followed by their | |
10735 position code(s) with the high bit set. Characters in private character | |
10736 sets have their leading byte prefixed with a @dfn{leading byte prefix}, | |
10737 which is either 0x9E or 0x9F. (No character sets are ever assigned these | |
10738 leading bytes.) Specifically: | |
10739 | |
10740 @example | |
10741 Character set Encoding (PC=position-code, LB=leading-byte) | |
10742 ------------- -------- | |
10743 ASCII PC-1 | | |
10744 Control-1 LB | PC1 + 0xA0 | | |
10745 Dimension-1 official LB | PC1 + 0x80 | | |
10746 Dimension-1 private 0x9E | LB | PC1 + 0x80 | | |
10747 Dimension-2 official LB | PC1 + 0x80 | PC2 + 0x80 | | |
10748 Dimension-2 private 0x9F | LB | PC1 + 0x80 | PC2 + 0x80 | |
10749 @end example | |
10750 | |
10751 The basic characteristic of this encoding is that the first byte | |
10752 of all characters is in the range 0x00 - 0x9F, and the second and | |
10753 following bytes of all characters is in the range 0xA0 - 0xFF. | |
10754 This means that it is impossible to get out of sync, or more | |
10755 specifically: | |
10756 | |
10757 @enumerate | |
10758 @item | |
10759 Given any byte position, the beginning of the character it is | |
10760 within can be determined in constant time. | |
10761 @item | |
10762 Given any byte position at the beginning of a character, the | |
10763 beginning of the next character can be determined in constant | |
10764 time. | |
10765 @item | |
10766 Given any byte position at the beginning of a character, the | |
10767 beginning of the previous character can be determined in constant | |
10768 time. | |
10769 @item | |
10770 Textual searches can simply treat encoded strings as if they | |
10771 were encoded in a one-byte-per-character fashion rather than | |
10772 the actual multi-byte encoding. | |
10773 @end enumerate | |
10774 | |
10775 None of the standard non-modal encodings meet all of these | |
10776 conditions. For example, EUC satisfies only (2) and (3), while | |
10777 Shift-JIS and Big5 (not yet described) satisfy only (2). (All | |
10778 non-modal encodings must satisfy (2), in order to be unambiguous.) | |
10779 | |
10780 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings | |
10781 @subsection Internal Character Encoding | |
10782 @cindex internal character encoding | |
10783 @cindex character encoding, internal | |
10784 @cindex encoding, internal character | |
10785 | |
10786 One 19-bit word represents a single character. The word is | |
10787 separated into three fields: | |
10788 | |
10789 @example | |
10790 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | |
10791 <------------> <------------------> <------------------> | |
10792 Field: 1 2 3 | |
10793 @end example | |
10794 | |
10795 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits. | |
10796 | |
10797 @example | |
10798 Character set Field 1 Field 2 Field 3 | |
10799 ------------- ------- ------- ------- | |
10800 ASCII 0 0 PC1 | |
10801 range: (00 - 7F) | |
10802 Control-1 0 1 PC1 | |
10803 range: (00 - 1F) | |
10804 Dimension-1 official 0 LB - 0x7F PC1 | |
10805 range: (01 - 0D) (20 - 7F) | |
10806 Dimension-1 private 0 LB - 0x80 PC1 | |
10807 range: (20 - 6F) (20 - 7F) | |
10808 Dimension-2 official LB - 0x8F PC1 PC2 | |
10809 range: (01 - 0A) (20 - 7F) (20 - 7F) | |
10810 Dimension-2 private LB - 0xE1 PC1 PC2 | |
10811 range: (0F - 1E) (20 - 7F) (20 - 7F) | |
10812 Composite 0x1F ? ? | |
10813 @end example | |
10814 | |
10815 Note that character codes 0 - 255 are the same as the ``binary | |
10816 encoding'' described above. | |
10817 | |
10818 Most of the code in XEmacs knows nothing of the representation of a | |
10819 character other than that values 0 - 255 represent ASCII, Control 1, | |
10820 and Latin 1. | |
10821 | |
10822 @strong{WARNING WARNING WARNING}: The Boyer-Moore code in | |
10823 @file{search.c}, and the code in @code{search_buffer()} that determines | |
10824 whether that code can be used, knows that ``field 3'' in a character | |
10825 always corresponds to the last byte in the textual representation of the | |
10826 character. (This is important because the Boyer-Moore algorithm works by | |
10827 looking at the last byte of the search string and &&#### finish this. | |
10828 | |
10829 @node Byte/Character Types; Buffer Positions; Other Typedefs, Internal Text API's, Internal Mule Encodings, Multilingual Support | |
10830 @section Byte/Character Types; Buffer Positions; Other Typedefs | |
10831 @cindex byte/character types; buffer positions; other typedefs | |
10832 @cindex byte/character types | |
10833 @cindex character types | |
10834 @cindex buffer positions | |
10835 @cindex typedefs, other | |
10836 | |
10837 @menu | |
10838 * Byte Types:: | |
10839 * Different Ways of Seeing Internal Text:: | |
10840 * Buffer Positions:: | |
10841 * Other Typedefs:: | |
10842 * Usage of the Various Representations:: | |
10843 * Working With the Various Representations:: | |
10844 @end menu | |
10845 | |
10846 @node Byte Types, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs | |
10847 @subsection Byte Types | |
10848 @cindex byte types | |
10849 | |
10850 Stuff pointed to by a char * or unsigned char * will nearly always be | |
10851 one of the following types: | |
10852 | |
10853 @itemize @minus | |
10854 @item | |
10855 a) [Ibyte] pointer to internally-formatted text | |
10856 @item | |
10857 b) [Extbyte] pointer to text in some external format, which can be | |
10858 defined as all formats other than the internal one | |
10859 @item | |
10860 c) [Ascbyte] pure ASCII text | |
10861 @item | |
10862 d) [Binbyte] binary data that is not meant to be interpreted as text | |
10863 @item | |
10864 e) [Rawbyte] general data in memory, where we don't care about whether | |
10865 it's text or binary | |
10866 @item | |
10867 f) [Boolbyte] a zero or a one | |
10868 @item | |
10869 g) [Bitbyte] a byte used for bit fields | |
10870 @item | |
10871 h) [Chbyte] null-semantics @code{char *}; used when casting an argument to | |
10872 an external API where the the other types may not be | |
10873 appropriate | |
10874 @end itemize | |
10875 | |
10876 Types (b), (c), (f) and (h) are defined as @code{char}, while the others are | |
10877 @code{unsigned char}. This is for maximum safety (signed characters are | |
10878 dangerous to work with) while maintaining as much compatibility with | |
10879 external API's and string constants as possible. | |
10880 | |
10881 We also provide versions of the above types defined with different | |
10882 underlying C types, for API compatibility. These use the following | |
10883 prefixes: | |
10884 | |
10885 @example | |
10886 C = plain char, when the base type is unsigned | |
10887 U = unsigned | |
10888 S = signed | |
10889 @end example | |
10890 | |
10891 (Formerly I had a comment saying that type (e) "should be replaced with | |
10892 void *". However, there are in fact many places where an unsigned char | |
10893 * might be used -- e.g. for ease in pointer computation, since void * | |
10894 doesn't allow this, and for compatibility with external API's.) | |
10895 | |
10896 Note that these typedefs are purely for documentation purposes; from | |
10897 the C code's perspective, they are exactly equivalent to @code{char *}, | |
10898 @code{unsigned char *}, etc., so you can freely use them with library | |
10899 functions declared as such. | |
10900 | |
10901 Using these more specific types rather than the general ones helps avoid | |
10902 the confusions that occur when the semantics of a char * or unsigned | |
10903 char * argument being studied are unclear. Furthermore, by requiring | |
10904 that ALL uses of @code{char} be replaced with some other type as part of the | |
10905 Mule-ization process, we can use a search for @code{char} as a way of finding | |
10906 code that has not been properly Mule-ized yet. | |
10907 | |
10908 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs | |
10909 @subsection Different Ways of Seeing Internal Text | |
10910 @cindex different ways of seeing internal text | |
10911 | |
10912 There are various ways of representing internal text. The two primary | |
10913 ways are as an "array" of individual characters; the other is as a | |
10914 "stream" of bytes. In the ASCII world, where there are only 255 | |
10915 characters at most, things are easy because each character fits into a | |
10916 byte. In general, however, this is not true -- see the above discussion | |
10917 of characters vs. encodings. | |
10918 | |
10919 In some cases, it's also important to distinguish between a stream | |
10920 representation as a series of bytes and as a series of textual units. | |
10921 This is particularly important wrt Unicode. The UTF-16 representation | |
10922 (sometimes referred to, rather sloppily, as simply the "Unicode" format) | |
10923 represents text as a series of 16-bit units. Mostly, each unit | |
10924 corresponds to a single character, but not necessarily, as characters | |
10925 outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of | |
10926 Unicode) require two 16-bit units, through the mechanism of | |
10927 "surrogates". When a series of 16-bit units is serialized into a byte | |
10928 stream, there are at least two possible representations, little-endian | |
10929 and big-endian, and which one is used may depend on the native format of | |
10930 16-bit integers in the CPU of the machine that XEmacs is running | |
10931 on. (Similarly, UTF-32 is logically a representation with 32-bit textual | |
10932 units.) | |
10933 | |
10934 Specifically: | |
10935 | |
10936 @itemize @minus | |
10937 @item | |
10938 UTF-8 has 1-byte (8-bit) units. | |
10939 @item | |
10940 UTF-16 has 2-byte (16-bit) units. | |
10941 @item | |
10942 UTF-32 has 4-byte (32-bit) units. | |
10943 @item | |
10944 XEmacs-internal encoding (the old "Mule" encoding) has 1-byte (8-bit) | |
10945 units. | |
10946 @item | |
10947 UTF-7 technically has 7-bit units that are within the "mail-safe" range | |
10948 (ASCII 32 - 126 plus a few control characters), but normally is encoded | |
10949 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a | |
10950 normal mode where printable ASCII characters represent themselves and a | |
10951 shifted mode, introduced with a plus sign, where a base-64 encoding is | |
10952 used.) | |
10953 @item | |
10954 UTF-5 technically has 7-bit units (normally encoded in an 8-bit stream, | |
10955 like UTF-7), but only uses uppercase A-V and 0-9, and only encodes 4 | |
10956 bits worth of data per character. UTF-5 is meant for encoding Unicode | |
10957 inside of DNS names. | |
10958 @end itemize | |
10959 | |
10960 Thus, we can imagine three levels in the representation of texual data: | |
10961 | |
10962 @example | |
10963 series of characters -> series of textual units -> series of bytes | |
10964 [Ichar] [Itext] [Ibyte] | |
10965 @end example | |
10966 | |
10967 XEmacs has three corresponding typedefs: | |
10968 | |
10969 @itemize @minus | |
10970 @item | |
10971 An Ichar is an integer (at least 32-bit), representing a 31-bit | |
10972 character. | |
10973 @item | |
10974 An Itext is an unsigned value, either 8, 16 or 32 bits, depending | |
10975 on the nature of the internal representation, and corresponding to | |
10976 a single textual unit. | |
10977 @item | |
10978 An Ibyte is an @code{unsigned char}, representing a single byte in a | |
10979 textual byte stream. | |
10980 @end itemize | |
10981 | |
10982 Internal text in stream format can be simultaneously viewed as either | |
10983 @code{Itext *} or @code{Ibyte *}. The @code{Ibyte *} representation is convenient for | |
10984 copying data from one place to another, because such routines usually | |
10985 expect byte counts. However, @code{Itext *} is much better for actually | |
10986 working with the data. | |
10987 | |
10988 From a text-unit perspective, units 0 through 127 will always be ASCII | |
10989 compatible, and data in Lisp strings (and other textual data generated | |
10990 as a whole, e.g. from external conversion) will be followed by a | |
10991 null-unit terminator. From an @code{Ibyte *} perspective, however, the | |
10992 encoding is only ASCII-compatible if it uses 1-byte units. | |
10993 | |
10994 Similarly to the different text representations, three integral count | |
10995 types exist -- Charcount, Textcount and Bytecount. | |
10996 | |
10997 NOTE: Despite the presence of the terminator, internal text itself can | |
10998 have nulls in it! (Null text units, not just the null bytes present in | |
10999 any UTF-16 encoding.) The terminator is present because in many cases | |
11000 internal text is passed to routines that will ultimately pass the text | |
11001 to library functions that cannot handle embedded nulls, e.g. functions | |
11002 manipulating filenames, and it is a real hassle to have to pass the | |
11003 length around constantly. But this can lead to sloppy coding! We need | |
11004 to be careful about watching for nulls in places that are important, | |
11005 e.g. manipulating string objects or passing data to/from the clipboard. | |
11006 | |
11007 @table @code | |
11008 @item Ibyte | |
11009 The data in a buffer or string is logically made up of Ibyte objects, | |
11010 where a Ibyte takes up the same amount of space as a char. (It is | |
11011 declared differently, though, to catch invalid usages.) Strings stored | |
11012 using Ibytes are said to be in "internal format". The important | |
11013 characteristics of internal format are | |
11014 | |
11015 @itemize @minus | |
11016 @item | |
11017 ASCII characters are represented as a single Ibyte, in the range 0 - | |
11018 0x7f. | |
11019 @item | |
11020 All other characters are represented as a Ibyte in the range 0x80 - 0x9f | |
11021 followed by one or more Ibytes in the range 0xa0 to 0xff. | |
11022 @end itemize | |
11023 | |
11024 This leads to a number of desirable properties: | |
11025 | |
11026 @itemize @minus | |
11027 @item | |
11028 Given the position of the beginning of a character, you can find the | |
11029 beginning of the next or previous character in constant time. | |
11030 @item | |
11031 When searching for a substring or an ASCII character within the string, | |
11032 you need merely use standard searching routines. | |
11033 @end itemize | |
11034 | |
11035 @item Itext | |
11036 | |
11037 #### Document me. | |
11038 | |
11039 @item Ichar | |
11040 This typedef represents a single Emacs character, which can be ASCII, | |
11041 ISO-8859, or some extended character, as would typically be used for | |
11042 Kanji. Note that the representation of a character as an Ichar is @strong{not} | |
11043 the same as the representation of that same character in a string; thus, | |
11044 you cannot do the standard C trick of passing a pointer to a character | |
11045 to a function that expects a string. | |
11046 | |
11047 An Ichar takes up 19 bits of representation and (for code compatibility | |
11048 and such) is compatible with an int. This representation is visible on | |
11049 the Lisp level. The important characteristics of the Ichar | |
11050 representation are | |
11051 | |
11052 @itemize @minus | |
11053 @item | |
11054 values 0x00 - 0x7f represent ASCII. | |
11055 @item | |
11056 values 0x80 - 0xff represent the right half of ISO-8859-1. | |
11057 @item | |
11058 values 0x100 and up represent all other characters. | |
11059 @end itemize | |
11060 | |
11061 This means that Ichar values are upwardly compatible with the standard | |
11062 8-bit representation of ASCII/ISO-8859-1. | |
11063 | |
11064 @item Extbyte | |
11065 Strings that go in or out of Emacs are in "external format", typedef'ed | |
11066 as an array of char or a char *. There is more than one external format | |
11067 (JIS, EUC, etc.) but they all have similar properties. They are modal | |
11068 encodings, which is to say that the meaning of particular bytes is not | |
11069 fixed but depends on what "mode" the string is currently in (e.g. bytes | |
11070 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or | |
11071 as 2-byte Kanji, depending on the current mode). The mode starts out in | |
11072 ASCII/ISO-8859-1 and is switched using escape sequences -- for example, | |
11073 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes | |
11074 in the range 0 - 0x7f are interpreted as Kanji characters. | |
11075 | |
11076 External-formatted data is generally desirable for passing data between | |
11077 programs because it is upwardly compatible with standard | |
11078 ASCII/ISO-8859-1 strings and may require less space than internal | |
11079 encodings such as the one described above. In addition, some encodings | |
11080 (e.g. JIS) keep all characters (except the ESC used to switch modes) in | |
11081 the printing ASCII range 0x20 - 0x7e, which results in a much higher | |
11082 probability that the data will avoid being garbled in transmission. | |
11083 Externally-formatted data is generally not very convenient to work with, | |
11084 however, and for this reason is usually converted to internal format | |
11085 before any work is done on the string. | |
11086 | |
11087 NOTE: filenames need to be in external format so that ISO-8859-1 | |
11088 characters come out correctly. | |
11089 @end table | |
11090 | |
11091 @node Buffer Positions, Other Typedefs, Different Ways of Seeing Internal Text, Byte/Character Types; Buffer Positions; Other Typedefs | |
11092 @subsection Buffer Positions | |
11093 @cindex buffer positions | |
11094 | |
11095 There are three possible ways to specify positions in a buffer. All | |
11096 of these are one-based: the beginning of the buffer is position or | |
11097 index 1, and 0 is not a valid position. | |
11098 | |
11099 As a "buffer position" (typedef Charbpos): | |
11100 | |
11101 This is an index specifying an offset in characters from the | |
11102 beginning of the buffer. Note that buffer positions are | |
11103 logically @strong{between} characters, not on a character. The | |
11104 difference between two buffer positions specifies the number of | |
11105 characters between those positions. Buffer positions are the | |
11106 only kind of position externally visible to the user. | |
11107 | |
11108 As a "byte index" (typedef Bytebpos): | |
11109 | |
11110 This is an index over the bytes used to represent the characters | |
11111 in the buffer. If there is no Mule support, this is identical | |
11112 to a buffer position, because each character is represented | |
11113 using one byte. However, with Mule support, many characters | |
11114 require two or more bytes for their representation, and so a | |
11115 byte index may be greater than the corresponding buffer | |
11116 position. | |
11117 | |
11118 As a "memory index" (typedef Membpos): | |
11119 | |
11120 This is the byte index adjusted for the gap. For positions | |
11121 before the gap, this is identical to the byte index. For | |
11122 positions after the gap, this is the byte index plus the gap | |
11123 size. There are two possible memory indices for the gap | |
11124 position; the memory index at the beginning of the gap should | |
11125 always be used, except in code that deals with manipulating the | |
11126 gap, where both indices may be seen. The address of the | |
11127 character "at" (i.e. following) a particular position can be | |
11128 obtained from the formula | |
11129 | |
11130 buffer_start_address + memory_index(position) - 1 | |
11131 | |
11132 except in the case of characters at the gap position. | |
11133 | |
11134 @node Other Typedefs, Usage of the Various Representations, Buffer Positions, Byte/Character Types; Buffer Positions; Other Typedefs | |
11135 @subsection Other Typedefs | |
11136 @cindex other typedefs | |
11137 | |
11138 Charcount: | |
11139 ---------- | |
11140 This typedef represents a count of characters, such as | |
11141 a character offset into a string or the number of | |
11142 characters between two positions in a buffer. The | |
11143 difference between two Charbpos's is a Charcount, and | |
11144 character positions in a string are represented using | |
11145 a Charcount. | |
11146 | |
11147 Textcount: | |
11148 ---------- | |
11149 #### Document me. | |
11150 | |
11151 Bytecount: | |
11152 ---------- | |
11153 Similar to a Charcount but represents a count of bytes. | |
11154 The difference between two Bytebpos's is a Bytecount. | |
11155 | |
11156 | |
11157 @node Usage of the Various Representations, Working With the Various Representations, Other Typedefs, Byte/Character Types; Buffer Positions; Other Typedefs | |
11158 @subsection Usage of the Various Representations | |
11159 @cindex usage of the various representations | |
11160 | |
11161 Memory indices are used in low-level functions in insdel.c and for | |
11162 extent endpoints and marker positions. The reason for this is that | |
11163 this way, the extents and markers don't need to be updated for most | |
11164 insertions, which merely shrink the gap and don't move any | |
11165 characters around in memory. | |
11166 | |
11167 (The beginning-of-gap memory index simplifies insertions w.r.t. | |
11168 markers, because text usually gets inserted after markers. For | |
11169 extents, it is merely for consistency, because text can get | |
11170 inserted either before or after an extent's endpoint depending on | |
11171 the open/closedness of the endpoint.) | |
11172 | |
11173 Byte indices are used in other code that needs to be fast, | |
11174 such as the searching, redisplay, and extent-manipulation code. | |
11175 | |
11176 Buffer positions are used in all other code. This is because this | |
11177 representation is easiest to work with (especially since Lisp | |
11178 code always uses buffer positions), necessitates the fewest | |
11179 changes to existing code, and is the safest (e.g. if the text gets | |
11180 shifted underneath a buffer position, it will still point to a | |
11181 character; if text is shifted under a byte index, it might point | |
11182 to the middle of a character, which would be bad). | |
11183 | |
11184 Similarly, Charcounts are used in all code that deals with strings | |
11185 except for code that needs to be fast, which used Bytecounts. | |
11186 | |
11187 Strings are always passed around internally using internal format. | |
11188 Conversions between external format are performed at the time | |
11189 that the data goes in or out of Emacs. | |
11190 | |
11191 @node Working With the Various Representations, , Usage of the Various Representations, Byte/Character Types; Buffer Positions; Other Typedefs | |
11192 @subsection Working With the Various Representations | |
11193 @cindex working with the various representations | |
11194 | |
11195 We write things this way because it's very important the | |
11196 MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens, | |
11197 65535 is a multiple of 3, but this may not always be the | |
11198 case. #### unfinished | |
11199 | |
11200 @node Internal Text API's, Coding for Mule, Byte/Character Types; Buffer Positions; Other Typedefs, Multilingual Support | |
11201 @section Internal Text API's | |
11202 @cindex internal text API's | |
11203 @cindex text API's, internal | |
11204 @cindex API's, text, internal | |
11205 | |
11206 @strong{NOTE}: The most current documentation for these API's is in | |
11207 @file{text.h}. In case of error, assume that file is correct and this | |
11208 one wrong. | |
11209 | |
11210 @menu | |
11211 * Basic internal-format API's:: | |
11212 * The DFC API:: | |
11213 * The Eistring API:: | |
11214 @end menu | |
11215 | |
11216 @node Basic internal-format API's, The DFC API, Internal Text API's, Internal Text API's | |
11217 @subsection Basic internal-format API's | |
11218 @cindex basic internal-format API's | |
11219 @cindex internal-format API's, basic | |
11220 @cindex API's, basic internal-format | |
11221 | |
11222 These are simple functions and macros to convert between text | |
11223 representation and characters, move forward and back in text, etc. | |
11224 | |
11225 #### Finish the rest of this. | |
11226 | |
11227 Use the following functions/macros on contiguous text in any of the | |
11228 internal formats. Those that take a format arg work on all internal | |
11229 formats; the others work only on the default (variable-width under Mule) | |
11230 format. If the text you're operating on is known to come from a buffer, | |
11231 use the buffer-level functions in buffer.h, which automatically know the | |
11232 correct format and handle the gap. | |
11233 | |
11234 Some terminology: | |
11235 | |
11236 "itext" appearing in the macros means "internal-format text" -- type | |
11237 @code{Ibyte *}. Operations on such pointers themselves, rather than on the | |
11238 text being pointed to, have "itext" instead of "itext" in the macro | |
11239 name. "ichar" in the macro names means an Ichar -- the representation | |
11240 of a character as a single integer rather than a series of bytes, as part | |
11241 of "itext". Many of the macros below are for converting between the | |
11242 two representations of characters. | |
11243 | |
11244 Note also that we try to consistently distinguish between an "Ichar" and | |
11245 a Lisp character. Stuff working with Lisp characters often just says | |
11246 "char", so we consistently use "Ichar" when that's what we're working | |
11247 with. | |
11248 | |
11249 @node The DFC API, The Eistring API, Basic internal-format API's, Internal Text API's | |
11250 @subsection The DFC API | |
11251 @cindex DFC API | |
11252 @cindex API, DFC | |
11253 | |
11254 This is for conversion between internal and external text. Note that | |
11255 there is also the "new DFC" API, which @strong{returns} a pointer to the | |
11256 converted text (in alloca space), rather than storing it into a | |
11257 variable. | |
11258 | |
11259 The macros below are used for converting data between different formats. | |
11260 Generally, the data is textual, and the formats are related to | |
11261 internationalization (e.g. converting between internal-format text and | |
11262 UTF-8) -- but the mechanism is general, and could be used for anything, | |
11263 e.g. decoding gzipped data. | |
11264 | |
11265 In general, conversion involves a source of data, a sink, the existing | |
11266 format of the source data, and the desired format of the sink. The | |
11267 macros below, however, always require that either the source or sink is | |
11268 internal-format text. Therefore, in practice the conversions below | |
11269 involve source, sink, an external format (specified by a coding system), | |
11270 and the direction of conversion (internal->external or vice-versa). | |
11271 | |
11272 Sources and sinks can be raw data (sized or unsized -- when unsized, | |
11273 input data is assumed to be null-terminated [double null-terminated for | |
11274 Unicode-format data], and on output the length is not stored anywhere), | |
11275 Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the | |
11276 output is raw data, the result can be allocated either with @code{alloca()} or | |
11277 @code{malloc()}. (There is currently no provision for writing into a fixed | |
11278 buffer. If you want this, use @code{alloca()} output and then copy the data -- | |
11279 but be careful with the size! Unless you are very sure of the encoding | |
11280 being used, upper bounds for the size are not in general computable.) | |
11281 The obvious restrictions on source and sink types apply (e.g. Lisp | |
11282 strings are a source and sink only for internal data). | |
11283 | |
11284 All raw data outputted will contain an extra null byte (two bytes for | |
11285 Unicode -- currently, in fact, all output data, whether internal or | |
11286 external, is double-null-terminated, but you can't count on this; see | |
11287 below). This means that enough space is allocated to contain the extra | |
11288 nulls; however, these nulls are not reflected in the returned output | |
11289 size. | |
11290 | |
11291 The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. | |
11292 These can be used to convert between any kinds of sources or sinks. | |
11293 However, 99% of conversions involve raw data or Lisp strings as both | |
11294 source and sink, and usually data is output as @code{alloca()} rather than | |
11295 @code{malloc()}. For this reason, convenience macros are defined for many types | |
11296 of conversions involving raw data and/or Lisp strings, especially when | |
11297 the output is an @code{alloca()}ed string. (When the destination is a | |
11298 Lisp_String, there are other functions that should be used instead -- | |
11299 @code{build_ext_string()} and @code{make_ext_string()}, for example.) The convenience | |
11300 macros are of two types -- the older kind that store the result into a | |
11301 specified variable, and the newer kind that return the result. The newer | |
11302 kind of macros don't exist when the output is sized data, because that | |
11303 would have two return values. NOTE: All convenience macros are | |
11304 ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. | |
11305 Thus, any comments below about the workings of these macros also apply to | |
11306 all convenience macros. | |
11307 | |
11308 @example | |
11309 TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys) | |
11310 TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys) | |
11311 @end example | |
11312 | |
11313 Typical use is | |
11314 | |
11315 @example | |
11316 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name); | |
11317 @end example | |
11318 | |
11319 which means that the contents of the lisp string @var{str} are written | |
11320 to a malloc'ed memory area which will be pointed to by @var{ptr}, after the | |
11321 function returns. The conversion will be done using the @code{file-name} | |
11322 coding system (which will be controlled by the user indirectly by | |
11323 setting or binding the variable @code{file-name-coding-system}). | |
11324 | |
11325 Some sources and sinks require two C variables to specify. We use | |
11326 some preprocessor magic to allow different source and sink types, and | |
11327 even different numbers of arguments to specify different types of | |
11328 sources and sinks. | |
11329 | |
11330 So we can have a call that looks like | |
11331 | |
11332 @example | |
11333 TO_INTERNAL_FORMAT (DATA, (ptr, len), | |
11334 MALLOC, (ptr, len), | |
11335 coding_system); | |
11336 @end example | |
11337 | |
11338 The parenthesized argument pairs are required to make the | |
11339 preprocessor magic work. | |
11340 | |
11341 NOTE: GC is inhibited during the entire operation of these macros. This | |
11342 is because frequently the data to be converted comes from strings but | |
11343 gets passed in as just DATA, and GC may move around the string data. If | |
11344 we didn't inhibit GC, there'd have to be a lot of messy recoding, | |
11345 alloca-copying of strings and other annoying stuff. | |
11346 | |
11347 The source or sink can be specified in one of these ways: | |
11348 | |
11349 @example | |
11350 DATA, (ptr, len), // input data is a fixed buffer of size len | |
11351 ALLOCA, (ptr, len), // output data is in a @code{ALLOCA()}ed buffer of size len | |
11352 MALLOC, (ptr, len), // output data is in a @code{malloc()}ed buffer of size len | |
11353 C_STRING_ALLOCA, ptr, // equivalent to ALLOCA (ptr, len_ignored) on output | |
11354 C_STRING_MALLOC, ptr, // equivalent to MALLOC (ptr, len_ignored) on output | |
11355 C_STRING, ptr, // equivalent to DATA, (ptr, strlen/wcslen (ptr)) | |
11356 // on input (the Unicode version is used when correct) | |
11357 LISP_STRING, string, // input or output is a Lisp_Object of type string | |
11358 LISP_BUFFER, buffer, // output is written to (point) in lisp buffer | |
11359 LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream | |
11360 LISP_OPAQUE, object, // input or output is a Lisp_Object of type opaque | |
11361 @end example | |
11362 | |
11363 When specifying the sink, use lvalues, since the macro will assign to them, | |
11364 except when the sink is an lstream or a lisp buffer. | |
11365 | |
11366 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the resulting text is | |
11367 stored in a stack-allocated buffer, which is automatically freed on | |
11368 returning from the function. However, the sink types @code{MALLOC} and | |
11369 @code{C_STRING_MALLOC} return @code{xmalloc()}ed memory. The caller is responsible | |
11370 for freeing this memory using @code{xfree()}. | |
11371 | |
11372 The macros accept the kinds of sources and sinks appropriate for | |
11373 internal and external data representation. See the type_checking_assert | |
11374 macros below for the actual allowed types. | |
11375 | |
11376 Since some sources and sinks use one argument (a Lisp_Object) to | |
11377 specify them, while others take a (pointer, length) pair, we use | |
11378 some C preprocessor trickery to allow pair arguments to be specified | |
11379 by parenthesizing them, as in the examples above. | |
11380 | |
11381 Anything prefixed by dfc_ (`data format conversion') is private. | |
11382 They are only used to implement these macros. | |
11383 | |
11384 [[Using C_STRING* is appropriate for using with external APIs that | |
11385 take null-terminated strings. For internal data, we should try to | |
11386 be '\0'-clean - i.e. allow arbitrary data to contain embedded '\0'. | |
11387 | |
11388 Sometime in the future we might allow output to C_STRING_ALLOCA or | |
11389 C_STRING_MALLOC _only_ with @code{TO_EXTERNAL_FORMAT()}, not | |
11390 @code{TO_INTERNAL_FORMAT()}.]] | |
11391 | |
11392 The above comments are not true. Frequently (most of the time, in | |
11393 fact), external strings come as zero-terminated entities, where the | |
11394 zero-termination is the only way to find out the length. Even in | |
11395 cases where you can get the length, most of the time the system will | |
11396 still use the null to signal the end of the string, and there will | |
11397 still be no way to either send in or receive a string with embedded | |
11398 nulls. In such situations, it's pointless to track the length | |
11399 because null bytes can never be in the string. We have a lot of | |
11400 operations that make it easy to operate on zero-terminated strings, | |
11401 and forcing the user the deal with the length everywhere would only | |
11402 make the code uglier and more complicated, for no gain. --ben | |
11403 | |
11404 There is no problem using the same lvalue for source and sink. | |
11405 | |
11406 Also, when pointers are required, the code (currently at least) is | |
11407 lax and allows any pointer types, either in the source or the sink. | |
11408 This makes it possible, e.g., to deal with internal format data held | |
11409 in char *'s or external format data held in WCHAR * (i.e. Unicode). | |
11410 | |
11411 Finally, whenever storage allocation is called for, extra space is | |
11412 allocated for a terminating zero, and such a zero is stored in the | |
11413 appropriate place, regardless of whether the source data was | |
11414 specified using a length or was specified as zero-terminated. This | |
11415 allows you to freely pass the resulting data, no matter how | |
11416 obtained, to a routine that expects zero termination (modulo, of | |
11417 course, that any embedded zeros in the resulting text will cause | |
11418 truncation). In fact, currently two embedded zeros are allocated | |
11419 and stored after the data result. This is to allow for the | |
11420 possibility of storing a Unicode value on output, which needs the | |
11421 two zeros. Currently, however, the two zeros are stored regardless | |
11422 of whether the conversion is internal or external and regardless of | |
11423 whether the external coding system is in fact Unicode. This | |
11424 behavior may change in the future, and you cannot rely on this -- | |
11425 the most you can rely on is that sink data in Unicode format will | |
11426 have two terminating nulls, which combine to form one Unicode null | |
11427 character. | |
11428 | |
11429 NOTE: You might ask, why are these not written as functions that | |
11430 @strong{RETURN} the converted string, since that would allow them to be used | |
11431 much more conveniently, without having to constantly declare temporary | |
11432 variables? The answer is that in fact I originally did write the | |
11433 routines that way, but that required either | |
11434 | |
11435 @itemize @bullet | |
11436 @item | |
11437 (a) calling @code{alloca()} inside of a function call, or | |
11438 @item | |
11439 (b) using expressions separated by commas and a global temporary variable, or | |
11440 @item | |
11441 (c) using the GCC extension (@{ ... @}). | |
11442 @end itemize | |
11443 | |
11444 Turned out that all of the above had bugs, all caused by GCC (hence the | |
11445 comments about "those GCC wankers" and "ream gcc up the ass"). As for | |
11446 (a), some versions of GCC (especially on Intel platforms), which had | |
11447 buggy implementations of @code{alloca()} that couldn't handle being called | |
11448 inside of a function call -- they just decremented the stack right in the | |
11449 middle of pushing args. Oops, crash with stack trashing, very bad. (b) | |
11450 was an attempt to fix (a), and that led to further GCC crashes, esp. when | |
11451 you had two such calls in a single subexpression, because GCC couldn't be | |
11452 counted upon to follow even a minimally reasonable order of execution. | |
11453 True, you can't count on one argument being evaluated before another, but | |
11454 GCC would actually interleave them so that the temp var got stomped on by | |
11455 one while the other was accessing it. So I tried (c), which was | |
11456 problematic because that GCC extension has more bugs in it than a | |
11457 termite's nest. | |
11458 | |
11459 So reluctantly I converted to the current way. Now, that was awhile ago | |
11460 (c. 1994), and it appears that the bug involving alloca in function calls | |
11461 has long since been fixed. More recently, I defined the new-dfc routines | |
11462 down below, which DO allow exactly such convenience of returning your | |
11463 args rather than store them in temp variables, and I also wrote a | |
11464 configure check to see whether @code{alloca()} causes crashes inside of function | |
11465 calls, and if so use the portable @code{alloca()} implementation in alloca.c. | |
11466 If you define TEST_NEW_DFC, the old routines get written in terms of the | |
11467 new ones, and I've had a beta put out with this on and it appeared to | |
11468 this appears to cause no problems -- so we should consider | |
11469 switching, and feel no compunctions about writing further such function- | |
11470 like @code{alloca()} routines in lieu of statement-like ones. --ben | |
11471 | |
11472 @node The Eistring API, , The DFC API, Internal Text API's | |
11473 @subsection The Eistring API | |
11474 @cindex Eistring API | |
11475 @cindex API, Eistring | |
11476 | |
11477 (This API is currently under-used) When doing simple things with | |
11478 internal text, the basic internal-format API's are enough. But to do | |
11479 things like delete or replace a substring, concatenate various strings, | |
11480 etc. is difficult to do cleanly because of the allocation issues. | |
11481 The Eistring API is designed to deal with this, and provides a clean | |
11482 way of modifying and building up internal text. (Note that the former | |
11483 lack of this API has meant that some code uses Lisp strings to do | |
11484 similar manipulations, resulting in excess garbage and increased | |
11485 garbage collection.) | |
11486 | |
11487 NOTE: The Eistring API is (or should be) Mule-correct even without | |
11488 an ASCII-compatible internal representation. | |
11489 | |
11490 @example | |
11491 #### NOTE: This is a work in progress. Neither the API nor especially | |
11492 the implementation is finished. | |
11493 | |
11494 NOTE: An Eistring is a structure that makes it easy to work with | |
11495 internally-formatted strings of data. It provides operations similar | |
11496 in feel to the standard @code{strcpy()}, @code{strcat()}, @code{strlen()}, etc., but | |
11497 | |
11498 (a) it is Mule-correct | |
11499 (b) it does dynamic allocation so you never have to worry about size | |
11500 restrictions | |
11501 (c) it comes in an @code{ALLOCA()} variety (all allocation is stack-local, | |
11502 so there is no need to explicitly clean up) as well as a @code{malloc()} | |
11503 variety | |
11504 (d) it knows its own length, so it does not suffer from standard null | |
11505 byte brain-damage -- but it null-terminates the data anyway, so | |
11506 it can be passed to standard routines | |
11507 (e) it provides a much more powerful set of operations and knows about | |
11508 all the standard places where string data might reside: Lisp_Objects, | |
11509 other Eistrings, Ibyte * data with or without an explicit length, | |
11510 ASCII strings, Ichars, etc. | |
11511 (f) it provides easy operations to convert to/from externally-formatted | |
11512 data, and is easier to use than the standard TO_INTERNAL_FORMAT | |
11513 and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal | |
11514 and external version of its data, but the external version is only | |
11515 initialized or changed when you call @code{eito_external()}.) | |
11516 | |
11517 The idea is to make it as easy to write Mule-correct string manipulation | |
11518 code as it is to write normal string manipulation code. We also make | |
11519 the API sufficiently general that it can handle multiple internal data | |
11520 formats (e.g. some fixed-width optimizing formats and a default variable | |
11521 width format) and allows for @strong{ANY} data format we might choose in the | |
11522 future for the default format, including UCS2. (In other words, we can't | |
11523 assume that the internal format is ASCII-compatible and we can't assume | |
11524 it doesn't have embedded null bytes. We do assume, however, that any | |
11525 chosen format will have the concept of null-termination.) All of this is | |
11526 hidden from the user. | |
11527 | |
11528 #### It is really too bad that we don't have a real object-oriented | |
11529 language, or at least a language with polymorphism! | |
11530 | |
11531 | |
11532 ********************************************** | |
11533 * Declaration * | |
11534 ********************************************** | |
11535 | |
11536 To declare an Eistring, either put one of the following in the local | |
11537 variable section: | |
11538 | |
11539 DECLARE_EISTRING (name); | |
11540 Declare a new Eistring and initialize it to the empy string. This | |
11541 is a standard local variable declaration and can go anywhere in the | |
11542 variable declaration section. NAME itself is declared as an | |
11543 Eistring *, and its storage declared on the stack. | |
11544 | |
11545 DECLARE_EISTRING_MALLOC (name); | |
11546 Declare and initialize a new Eistring, which uses @code{malloc()}ed | |
11547 instead of @code{ALLOCA()}ed data. This is a standard local variable | |
11548 declaration and can go anywhere in the variable declaration | |
11549 section. Once you initialize the Eistring, you will have to free | |
11550 it using @code{eifree()} to avoid memory leaks. You will need to use this | |
11551 form if you are passing an Eistring to any function that modifies | |
11552 it (otherwise, the modified data may be in stack space and get | |
11553 overwritten when the function returns). | |
11554 | |
11555 or use | |
11556 | |
11557 Eistring ei; | |
11558 void eiinit (Eistring *ei); | |
11559 void eiinit_malloc (Eistring *einame); | |
11560 If you need to put an Eistring elsewhere than in a local variable | |
11561 declaration (e.g. in a structure), declare it as shown and then | |
11562 call one of the init macros. | |
11563 | |
11564 Also note: | |
11565 | |
11566 void eifree (Eistring *ei); | |
11567 If you declared an Eistring to use @code{malloc()} to hold its data, | |
11568 or converted it to the heap using @code{eito_malloc()}, then this | |
11569 releases any data in it and afterwards resets the Eistring | |
11570 using @code{eiinit_malloc()}. Otherwise, it just resets the Eistring | |
11571 using @code{eiinit()}. | |
11572 | |
11573 | |
11574 ********************************************** | |
11575 * Conventions * | |
11576 ********************************************** | |
11577 | |
11578 - The names of the functions have been chosen, where possible, to | |
11579 match the names of @code{str*()} functions in the standard C API. | |
11580 - | |
11581 | |
11582 | |
11583 ********************************************** | |
11584 * Initialization * | |
11585 ********************************************** | |
11586 | |
11587 void eireset (Eistring *eistr); | |
11588 Initialize the Eistring to the empty string. | |
11589 | |
11590 void eicpy_* (Eistring *eistr, ...); | |
11591 Initialize the Eistring from somewhere: | |
11592 | |
11593 void eicpy_ei (Eistring *eistr, Eistring *eistr2); | |
11594 ... from another Eistring. | |
11595 void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string); | |
11596 ... from a Lisp_Object string. | |
11597 void eicpy_ch (Eistring *eistr, Ichar ch); | |
11598 ... from an Ichar (this can be a conventional C character). | |
11599 | |
11600 void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string, | |
11601 Bytecount off, Charcount charoff, | |
11602 Bytecount len, Charcount charlen); | |
11603 ... from a section of a Lisp_Object string. | |
11604 void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf, | |
11605 Bytecount off, Charcount charoff, | |
11606 Bytecount len, Charcount charlen); | |
11607 ... from a section of a Lisp_Object buffer. | |
11608 void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len); | |
11609 ... from raw internal-format data in the default internal format. | |
11610 void eicpy_rawz (Eistring *eistr, const Ibyte *data); | |
11611 ... from raw internal-format data in the default internal format | |
11612 that is "null-terminated" (the meaning of this depends on the nature | |
11613 of the default internal format). | |
11614 void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len, | |
11615 Internal_Format intfmt, Lisp_Object object); | |
11616 ... from raw internal-format data in the specified format. | |
11617 void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data, | |
11618 Internal_Format intfmt, Lisp_Object object); | |
11619 ... from raw internal-format data in the specified format that is | |
11620 "null-terminated" (the meaning of this depends on the nature of | |
11621 the specific format). | |
11622 void eicpy_c (Eistring *eistr, const Ascbyte *c_string); | |
11623 ... from an ASCII null-terminated string. Non-ASCII characters in | |
11624 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
11625 void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len); | |
11626 ... from an ASCII string, with length specified. Non-ASCII characters | |
11627 in the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
11628 void eicpy_ext (Eistring *eistr, const Extbyte *extdata, | |
11629 Lisp_Object codesys); | |
11630 ... from external null-terminated data, with coding system specified. | |
11631 void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata, | |
11632 Bytecount extlen, Lisp_Object codesys); | |
11633 ... from external data, with length and coding system specified. | |
11634 void eicpy_lstream (Eistring *eistr, Lisp_Object lstream); | |
11635 ... from an lstream; reads data till eof. Data must be in default | |
11636 internal format; otherwise, interpose a decoding lstream. | |
11637 | |
11638 | |
11639 ********************************************** | |
11640 * Getting the data out of the Eistring * | |
11641 ********************************************** | |
11642 | |
11643 Ibyte *eidata (Eistring *eistr); | |
11644 Return a pointer to the raw data in an Eistring. This is NOT | |
11645 a copy. | |
11646 | |
11647 Lisp_Object eimake_string (Eistring *eistr); | |
11648 Make a Lisp string out of the Eistring. | |
11649 | |
11650 Lisp_Object eimake_string_off (Eistring *eistr, | |
11651 Bytecount off, Charcount charoff, | |
11652 Bytecount len, Charcount charlen); | |
11653 Make a Lisp string out of a section of the Eistring. | |
11654 | |
11655 void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out, | |
11656 LVALUE: Bytecount len_out); | |
11657 Make an @code{ALLOCA()} copy of the data in the Eistring, using the | |
11658 default internal format. Due to the nature of @code{ALLOCA()}, this | |
11659 must be a macro, with all lvalues passed in as parameters. | |
11660 (More specifically, not all compilers correctly handle using | |
11661 @code{ALLOCA()} as the argument to a function call -- GCC on x86 | |
11662 didn't used to, for example.) A pointer to the @code{ALLOCA()}ed data | |
11663 is stored in PTR_OUT, and the length of the data (not including | |
11664 the terminating zero) is stored in LEN_OUT. | |
11665 | |
11666 void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out, | |
11667 LVALUE: Bytecount len_out, | |
11668 Internal_Format intfmt, Lisp_Object object); | |
11669 Like @code{eicpyout_alloca()}, but converts to the specified internal | |
11670 format. (No formats other than FORMAT_DEFAULT are currently | |
11671 implemented, and you get an assertion failure if you try.) | |
11672 | |
11673 Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out); | |
11674 Make a @code{malloc()} copy of the data in the Eistring, using the | |
11675 default internal format. This is a real function. No lvalues | |
11676 passed in. Returns the new data, and stores the length (not | |
11677 including the terminating zero) using INTLEN_OUT, unless it's | |
11678 a NULL pointer. | |
11679 | |
11680 Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt, | |
11681 Bytecount *intlen_out, Lisp_Object object); | |
11682 Like @code{eicpyout_malloc()}, but converts to the specified internal | |
11683 format. (No formats other than FORMAT_DEFAULT are currently | |
11684 implemented, and you get an assertion failure if you try.) | |
11685 | |
11686 | |
11687 ********************************************** | |
11688 * Moving to the heap * | |
11689 ********************************************** | |
11690 | |
11691 void eito_malloc (Eistring *eistr); | |
11692 Move this Eistring to the heap. Its data will be stored in a | |
11693 @code{malloc()}ed block rather than the stack. Subsequent changes to | |
11694 this Eistring will @code{realloc()} the block as necessary. Use this | |
11695 when you want the Eistring to remain in scope past the end of | |
11696 this function call. You will have to manually free the data | |
11697 in the Eistring using @code{eifree()}. | |
11698 | |
11699 void eito_alloca (Eistring *eistr); | |
11700 Move this Eistring back to the stack, if it was moved to the | |
11701 heap with @code{eito_malloc()}. This will automatically free any | |
11702 heap-allocated data. | |
11703 | |
11704 | |
11705 | |
11706 ********************************************** | |
11707 * Retrieving the length * | |
11708 ********************************************** | |
11709 | |
11710 Bytecount eilen (Eistring *eistr); | |
11711 Return the length of the internal data, in bytes. See also | |
11712 @code{eiextlen()}, below. | |
11713 Charcount eicharlen (Eistring *eistr); | |
11714 Return the length of the internal data, in characters. | |
11715 | |
11716 | |
11717 ********************************************** | |
11718 * Working with positions * | |
11719 ********************************************** | |
11720 | |
11721 Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos); | |
11722 Convert a char offset to a byte offset. | |
11723 Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos); | |
11724 Convert a byte offset to a char offset. | |
11725 Bytecount eiincpos (Eistring *eistr, Bytecount bytepos); | |
11726 Increment the given position by one character. | |
11727 Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n); | |
11728 Increment the given position by N characters. | |
11729 Bytecount eidecpos (Eistring *eistr, Bytecount bytepos); | |
11730 Decrement the given position by one character. | |
11731 Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n); | |
11732 Deccrement the given position by N characters. | |
11733 | |
11734 | |
11735 ********************************************** | |
11736 * Getting the character at a position * | |
11737 ********************************************** | |
11738 | |
11739 Ichar eigetch (Eistring *eistr, Bytecount bytepos); | |
11740 Return the character at a particular byte offset. | |
11741 Ichar eigetch_char (Eistring *eistr, Charcount charpos); | |
11742 Return the character at a particular character offset. | |
11743 | |
11744 | |
11745 ********************************************** | |
11746 * Setting the character at a position * | |
11747 ********************************************** | |
11748 | |
11749 Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr); | |
11750 Set the character at a particular byte offset. | |
11751 Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr); | |
11752 Set the character at a particular character offset. | |
11753 | |
11754 | |
11755 ********************************************** | |
11756 * Concatenation * | |
11757 ********************************************** | |
11758 | |
11759 void eicat_* (Eistring *eistr, ...); | |
11760 Concatenate onto the end of the Eistring, with data coming from the | |
11761 same places as above: | |
11762 | |
11763 void eicat_ei (Eistring *eistr, Eistring *eistr2); | |
11764 ... from another Eistring. | |
11765 void eicat_c (Eistring *eistr, Ascbyte *c_string); | |
11766 ... from an ASCII null-terminated string. Non-ASCII characters in | |
11767 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
11768 void eicat_raw (ei, const Ibyte *data, Bytecount len); | |
11769 ... from raw internal-format data in the default internal format. | |
11770 void eicat_rawz (ei, const Ibyte *data); | |
11771 ... from raw internal-format data in the default internal format | |
11772 that is "null-terminated" (the meaning of this depends on the nature | |
11773 of the default internal format). | |
11774 void eicat_lstr (ei, Lisp_Object lisp_string); | |
11775 ... from a Lisp_Object string. | |
11776 void eicat_ch (ei, Ichar ch); | |
11777 ... from an Ichar. | |
11778 | |
11779 All except the first variety are convenience functions. | |
11780 n the general case, create another Eistring from the source.) | |
11781 | |
11782 | |
11783 ********************************************** | |
11784 * Replacement * | |
11785 ********************************************** | |
11786 | |
11787 void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
11788 Bytecount len, Charcount charlen, ...); | |
11789 Replace a section of the Eistring, specifically: | |
11790 | |
11791 void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
11792 Bytecount len, Charcount charlen, Eistring *eistr2); | |
11793 ... with another Eistring. | |
11794 void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
11795 Bytecount len, Charcount charlen, Ascbyte *c_string); | |
11796 ... with an ASCII null-terminated string. Non-ASCII characters in | |
11797 the string are @strong{ILLEGAL} (read @code{abort()} with error-checking defined). | |
11798 void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff, | |
11799 Bytecount len, Charcount charlen, Ichar ch); | |
11800 ... with an Ichar. | |
11801 | |
11802 void eidel (Eistring *eistr, Bytecount off, Charcount charoff, | |
11803 Bytecount len, Charcount charlen); | |
11804 Delete a section of the Eistring. | |
11805 | |
11806 | |
11807 ********************************************** | |
11808 * Converting to an external format * | |
11809 ********************************************** | |
11810 | |
11811 void eito_external (Eistring *eistr, Lisp_Object codesys); | |
11812 Convert the Eistring to an external format and store the result | |
11813 in the string. NOTE: Further changes to the Eistring will @strong{NOT} | |
11814 change the external data stored in the string. You will have to | |
11815 call @code{eito_external()} again in such a case if you want the external | |
11816 data. | |
11817 | |
11818 Extbyte *eiextdata (Eistring *eistr); | |
11819 Return a pointer to the external data stored in the Eistring as | |
11820 a result of a prior call to @code{eito_external()}. | |
11821 | |
11822 Bytecount eiextlen (Eistring *eistr); | |
11823 Return the length in bytes of the external data stored in the | |
11824 Eistring as a result of a prior call to @code{eito_external()}. | |
11825 | |
11826 | |
11827 ********************************************** | |
11828 * Searching in the Eistring for a character * | |
11829 ********************************************** | |
11830 | |
11831 Bytecount eichr (Eistring *eistr, Ichar chr); | |
11832 Charcount eichr_char (Eistring *eistr, Ichar chr); | |
11833 Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off, | |
11834 Charcount charoff); | |
11835 Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off, | |
11836 Charcount charoff); | |
11837 Bytecount eirchr (Eistring *eistr, Ichar chr); | |
11838 Charcount eirchr_char (Eistring *eistr, Ichar chr); | |
11839 Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off, | |
11840 Charcount charoff); | |
11841 Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off, | |
11842 Charcount charoff); | |
11843 | |
11844 | |
11845 ********************************************** | |
11846 * Searching in the Eistring for a string * | |
11847 ********************************************** | |
11848 | |
11849 Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2); | |
11850 Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2); | |
11851 Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off, | |
11852 Charcount charoff); | |
11853 Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2, | |
11854 Bytecount off, Charcount charoff); | |
11855 Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2); | |
11856 Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2); | |
11857 Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off, | |
11858 Charcount charoff); | |
11859 Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2, | |
11860 Bytecount off, Charcount charoff); | |
11861 | |
11862 Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string); | |
11863 Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string); | |
11864 Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off, | |
11865 Charcount charoff); | |
11866 Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string, | |
11867 Bytecount off, Charcount charoff); | |
11868 Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string); | |
11869 Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string); | |
11870 Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string, | |
11871 Bytecount off, Charcount charoff); | |
11872 Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string, | |
11873 Bytecount off, Charcount charoff); | |
11874 | |
11875 | |
11876 ********************************************** | |
11877 * Comparison * | |
11878 ********************************************** | |
11879 | |
11880 int eicmp_* (Eistring *eistr, ...); | |
11881 int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
11882 Bytecount len, Charcount charlen, ...); | |
11883 int eicasecmp_* (Eistring *eistr, ...); | |
11884 int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
11885 Bytecount len, Charcount charlen, ...); | |
11886 int eicasecmp_i18n_* (Eistring *eistr, ...); | |
11887 int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff, | |
11888 Bytecount len, Charcount charlen, ...); | |
11889 | |
11890 Compare the Eistring with the other data. Return value same as | |
11891 from strcmp. The `*' is either `ei' for another Eistring (in | |
11892 which case `...' is an Eistring), or `c' for a pure-ASCII string | |
11893 (in which case `...' is a pointer to that string). For anything | |
11894 more complex, first create an Eistring out of the source. | |
11895 Comparison is either simple (`eicmp_...'), ASCII case-folding | |
11896 (`eicasecmp_...'), or multilingual case-folding | |
11897 (`eicasecmp_i18n_...). | |
11898 | |
11899 | |
11900 More specifically, the prototypes are: | |
11901 | |
11902 int eicmp_ei (Eistring *eistr, Eistring *eistr2); | |
11903 int eicmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
11904 Bytecount len, Charcount charlen, Eistring *eistr2); | |
11905 int eicasecmp_ei (Eistring *eistr, Eistring *eistr2); | |
11906 int eicasecmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff, | |
11907 Bytecount len, Charcount charlen, Eistring *eistr2); | |
11908 int eicasecmp_i18n_ei (Eistring *eistr, Eistring *eistr2); | |
11909 int eicasecmp_i18n_off_ei (Eistring *eistr, Bytecount off, | |
11910 Charcount charoff, Bytecount len, | |
11911 Charcount charlen, Eistring *eistr2); | |
11912 | |
11913 int eicmp_c (Eistring *eistr, Ascbyte *c_string); | |
11914 int eicmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
11915 Bytecount len, Charcount charlen, Ascbyte *c_string); | |
11916 int eicasecmp_c (Eistring *eistr, Ascbyte *c_string); | |
11917 int eicasecmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
11918 Bytecount len, Charcount charlen, | |
11919 Ascbyte *c_string); | |
11920 int eicasecmp_i18n_c (Eistring *eistr, Ascbyte *c_string); | |
11921 int eicasecmp_i18n_off_c (Eistring *eistr, Bytecount off, Charcount charoff, | |
11922 Bytecount len, Charcount charlen, | |
11923 Ascbyte *c_string); | |
11924 | |
11925 | |
11926 ********************************************** | |
11927 * Case-changing the Eistring * | |
11928 ********************************************** | |
11929 | |
11930 void eilwr (Eistring *eistr); | |
11931 Convert all characters in the Eistring to lowercase. | |
11932 void eiupr (Eistring *eistr); | |
11933 Convert all characters in the Eistring to uppercase. | |
11934 @end example | |
11935 | |
11936 @node Coding for Mule, CCL, Internal Text API's, Multilingual Support | |
11937 @section Coding for Mule | |
11938 @cindex coding for Mule | |
11939 @cindex Mule, coding for | |
11940 | |
11941 Although Mule support is not compiled by default in XEmacs, many people | |
11942 are using it, and we consider it crucial that new code works correctly | |
11943 with multibyte characters. This is not hard; it is only a matter of | |
11944 following several simple user-interface guidelines. Even if you never | |
11945 compile with Mule, with a little practice you will find it quite easy | |
11946 to code Mule-correctly. | |
11947 | |
11948 Note that these guidelines are not necessarily tied to the current Mule | |
11949 implementation; they are also a good idea to follow on the grounds of | |
11950 code generalization for future I18N work. | |
11951 | |
11952 @menu | |
11953 * Character-Related Data Types:: | |
11954 * Working With Character and Byte Positions:: | |
11955 * Conversion to and from External Data:: | |
11956 * General Guidelines for Writing Mule-Aware Code:: | |
11957 * An Example of Mule-Aware Code:: | |
11958 * Mule-izing Code:: | |
11959 @end menu | |
11960 | |
11961 @node Character-Related Data Types, Working With Character and Byte Positions, Coding for Mule, Coding for Mule | |
11962 @subsection Character-Related Data Types | |
11963 @cindex character-related data types | |
11964 @cindex data types, character-related | |
11965 | |
11966 First, let's review the basic character-related datatypes used by | |
11967 XEmacs. Note that some of the separate @code{typedef}s are not | |
11968 mandatory, but they improve clarity of code a great deal, because one | |
11969 glance at the declaration can tell the intended use of the variable. | |
11970 | |
11971 @table @code | |
11972 @item Ichar | |
11973 @cindex Ichar | |
11974 An @code{Ichar} holds a single Emacs character. | |
11975 | |
11976 Obviously, the equality between characters and bytes is lost in the Mule | |
11977 world. Characters can be represented by one or more bytes in the | |
11978 buffer, and @code{Ichar} is a C type large enough to hold any | |
11979 character. (This currently isn't quite true for ISO 10646, which | |
11980 defines a character as a 31-bit non-negative quantity, while XEmacs | |
11981 characters are only 30-bits. This is irrelevant, unless you are | |
11982 considering using the ISO 10646 private groups to support really large | |
11983 private character sets---in particular, the Mule character set!---in | |
11984 a version of XEmacs using Unicode internally.) | |
11985 | |
11986 Without Mule support, an @code{Ichar} is equivalent to an | |
11987 @code{unsigned char}. [[This doesn't seem to be true; @file{lisp.h} | |
11988 unconditionally @samp{typedef}s @code{Ichar} to @code{int}.]] | |
11989 | |
11990 @item Ibyte | |
11991 @cindex Ibyte | |
11992 The data representing the text in a buffer or string is logically a set | |
11993 of @code{Ibyte}s. | |
11994 | |
11995 XEmacs does not work with the same character formats all the time; when | |
11996 reading characters from the outside, it decodes them to an internal | |
11997 format, and likewise encodes them when writing. @code{Ibyte} (in fact | |
11998 @code{unsigned char}) is the basic unit of XEmacs internal buffers and | |
11999 strings format. An @code{Ibyte *} is the type that points at text | |
12000 encoded in the variable-width internal encoding. | |
12001 | |
12002 One character can correspond to one or more @code{Ibyte}s. In the | |
12003 current Mule implementation, an ASCII character is represented by the | |
12004 same @code{Ibyte}, and other characters are represented by a sequence | |
12005 of two or more @code{Ibyte}s. (This will also be true of an | |
12006 implementation using UTF-8 as the internal encoding. In fact, only code | |
12007 that implements character code conversions and a very few macros used to | |
12008 implement motion by whole characters will notice the difference between | |
12009 UTF-8 and the Mule encoding.) | |
12010 | |
12011 Without Mule support, there are exactly 256 characters, implicitly | |
12012 Latin-1, and each character is represented using one @code{Ibyte}, and | |
12013 there is a one-to-one correspondence between @code{Ibyte}s and | |
12014 @code{Ichar}s. | |
12015 | |
12016 @item Charxpos | |
12017 @item Charbpos | |
12018 @itemx Charcount | |
12019 @cindex Charxpos | |
12020 @cindex Charbpos | |
12021 @cindex Charcount | |
12022 A @code{Charbpos} represents a character position in a buffer. A | |
12023 @code{Charcount} represents a number (count) of characters. Logically, | |
12024 subtracting two @code{Charbpos} values yields a @code{Charcount} value. | |
12025 When representing a character position in a string, we just use | |
12026 @code{Charcount} directly. The reason for having a separate typedef for | |
12027 buffer positions is that they are 1-based, whereas string positions are | |
12028 0-based and hence string counts and positions can be freely intermixed (a | |
12029 string position is equivalent to the count of characters from the | |
12030 beginning). When representing a character position that could be either | |
12031 in a buffer or string (for example, in the extent code), @code{Charxpos} | |
12032 is used. Although all of these are @code{typedef}ed to | |
12033 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make | |
12034 it clear what sort of position is being used. | |
12035 | |
12036 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the | |
12037 only ones that are ever visible to Lisp. | |
12038 | |
12039 @item Bytexpos | |
12040 @itemx Bytecount | |
12041 @cindex Bytebpos | |
12042 @cindex Bytecount | |
12043 A @code{Bytebpos} represents a byte position in a buffer. A | |
12044 @code{Bytecount} represents the distance between two positions, in | |
12045 bytes. Byte positions in strings use @code{Bytecount}, and for byte | |
12046 positions that can be either in a buffer or string, @code{Bytexpos} is | |
12047 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and | |
12048 @code{Bytecount} is the same as the relationship between | |
12049 @code{Charxpos}, @code{Charbpos} and @code{Charcount}. | |
12050 | |
12051 @item Extbyte | |
12052 @cindex Extbyte | |
12053 When dealing with the outside world, XEmacs works with @code{Extbyte}s, | |
12054 which are equivalent to @code{char}. The distance between two | |
12055 @code{Extbyte}s is a @code{Bytecount}, since external text is a | |
12056 byte-by-byte encoding. Extbytes occur mainly at the transition point | |
12057 between internal text and external functions. XEmacs code should not, | |
12058 if it can possibly avoid it, do any actual manipulation using external | |
12059 text, since its format is completely unpredictable (it might not even be | |
12060 ASCII-compatible). | |
12061 @end table | |
12062 | |
12063 @node Working With Character and Byte Positions, Conversion to and from External Data, Character-Related Data Types, Coding for Mule | |
12064 @subsection Working With Character and Byte Positions | |
12065 @cindex character and byte positions, working with | |
12066 @cindex byte positions, working with character and | |
12067 @cindex positions, working with character and byte | |
12068 | |
12069 Now that we have defined the basic character-related types, we can look | |
12070 at the macros and functions designed for work with them and for | |
12071 conversion between them. Most of these macros are defined in | |
12072 @file{buffer.h}, and we don't discuss all of them here, but only the | |
12073 most important ones. Examining the existing code is the best way to | |
12074 learn about them. | |
12075 | |
12076 @table @code | |
12077 @item MAX_ICHAR_LEN | |
12078 @cindex MAX_ICHAR_LEN | |
12079 This preprocessor constant is the maximum number of buffer bytes to | |
12080 represent an Emacs character in the variable width internal encoding. | |
12081 It is useful when allocating temporary strings to keep a known number of | |
12082 characters. For instance: | |
12083 | |
12084 @example | |
12085 @group | |
12086 @{ | |
12087 Charcount cclen; | |
12088 ... | |
12089 @{ | |
12090 /* Allocate place for @var{cclen} characters. */ | |
12091 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN); | |
12092 ... | |
12093 @end group | |
12094 @end example | |
12095 | |
12096 If you followed the previous section, you can guess that, logically, | |
12097 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces | |
12098 a @code{Bytecount} value. | |
12099 | |
12100 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4. | |
12101 Without Mule, it is 1. In a mature Unicode-based XEmacs, it will also | |
12102 be 4 (since all Unicode characters can be encoded in UTF-8 in 4 bytes or | |
12103 less), but some versions may use up to 6, in order to use the large | |
12104 private space provided by ISO 10646 to ``mirror'' the Mule code space. | |
12105 | |
12106 @item itext_ichar | |
12107 @itemx set_itext_ichar | |
12108 @cindex itext_ichar | |
12109 @cindex set_itext_ichar | |
12110 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and | |
12111 returns the @code{Ichar} stored at that position. If it were a | |
12112 function, its prototype would be: | |
12113 | |
12114 @example | |
12115 Ichar itext_ichar (Ibyte *p); | |
12116 @end example | |
12117 | |
12118 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte | |
12119 position. It returns the number of bytes stored: | |
12120 | |
12121 @example | |
12122 Bytecount set_itext_ichar (Ibyte *p, Ichar c); | |
12123 @end example | |
12124 | |
12125 It is important to note that @code{set_itext_ichar} is safe only for | |
12126 appending a character at the end of a buffer, not for overwriting a | |
12127 character in the middle. This is because the width of characters | |
12128 varies, and @code{set_itext_ichar} cannot resize the string if it | |
12129 writes, say, a two-byte character where a single-byte character used to | |
12130 reside. | |
12131 | |
12132 A typical use of @code{set_itext_ichar} can be demonstrated by this | |
12133 example, which copies characters from buffer @var{buf} to a temporary | |
12134 string of Ibytes. | |
12135 | |
12136 @example | |
12137 @group | |
12138 @{ | |
12139 Charbpos pos; | |
12140 for (pos = beg; pos < end; pos++) | |
12141 @{ | |
12142 Ichar c = BUF_FETCH_CHAR (buf, pos); | |
12143 p += set_itext_ichar (buf, c); | |
12144 @} | |
12145 @} | |
12146 @end group | |
12147 @end example | |
12148 | |
12149 Note how @code{set_itext_ichar} is used to store the @code{Ichar} | |
12150 and increment the counter, at the same time. | |
12151 | |
12152 @item INC_IBYTEPTR | |
12153 @itemx DEC_IBYTEPTR | |
12154 @cindex INC_IBYTEPTR | |
12155 @cindex DEC_IBYTEPTR | |
12156 These two macros increment and decrement an @code{Ibyte} pointer, | |
12157 respectively. They will adjust the pointer by the appropriate number of | |
12158 bytes according to the byte length of the character stored there. Both | |
12159 macros assume that the memory address is located at the beginning of a | |
12160 valid character. | |
12161 | |
12162 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)} | |
12163 simply expand to @code{p++} and @code{p--}, respectively. | |
12164 | |
12165 @item bytecount_to_charcount | |
12166 @cindex bytecount_to_charcount | |
12167 Given a pointer to a text string and a length in bytes, return the | |
12168 equivalent length in characters. | |
12169 | |
12170 @example | |
12171 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc); | |
12172 @end example | |
12173 | |
12174 @item charcount_to_bytecount | |
12175 @cindex charcount_to_bytecount | |
12176 Given a pointer to a text string and a length in characters, return the | |
12177 equivalent length in bytes. | |
12178 | |
12179 @example | |
12180 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc); | |
12181 @end example | |
12182 | |
12183 @item itext_n_addr | |
12184 @cindex itext_n_addr | |
12185 Return a pointer to the beginning of the character offset @var{cc} (in | |
12186 characters) from @var{p}. | |
12187 | |
12188 @example | |
12189 Ibyte *itext_n_addr (Ibyte *p, Charcount cc); | |
12190 @end example | |
12191 @end table | |
12192 | |
12193 @node Conversion to and from External Data, General Guidelines for Writing Mule-Aware Code, Working With Character and Byte Positions, Coding for Mule | |
12194 @subsection Conversion to and from External Data | |
12195 @cindex conversion to and from external data | |
12196 @cindex external data, conversion to and from | |
12197 | |
12198 When an external function, such as a C library function, returns a | |
12199 @code{char} pointer, you should almost never treat it as @code{Ibyte}. | |
12200 This is because these returned strings may contain 8bit characters which | |
12201 can be misinterpreted by XEmacs, and cause a crash. Likewise, when | |
12202 exporting a piece of internal text to the outside world, you should | |
12203 always convert it to an appropriate external encoding, lest the internal | |
12204 stuff (such as the infamous \201 characters) leak out. | |
12205 | |
12206 The interface to conversion between the internal and external | |
12207 representations of text are the numerous conversion macros defined in | |
12208 @file{buffer.h}. There used to be a fixed set of external formats | |
12209 supported by these macros, but now any coding system can be used with | |
12210 them. The coding system alias mechanism is used to create the | |
12211 following logical coding systems, which replace the fixed external | |
12212 formats. The (dontusethis-set-symbol-value-handler) mechanism was | |
12213 enhanced to make this possible (more work on that is needed). | |
12214 | |
12215 Often useful coding systems: | |
12216 | |
12217 @table @code | |
12218 @item Qbinary | |
12219 This is the simplest format and is what we use in the absence of a more | |
12220 appropriate format. This converts according to the @code{binary} coding | |
12221 system: | |
12222 | |
12223 @enumerate a | |
12224 @item | |
12225 On input, bytes 0--255 are converted into (implicitly Latin-1) | |
12226 characters 0--255. A non-Mule xemacs doesn't really know about | |
12227 different character sets and the fonts to display them, so the bytes can | |
12228 be treated as text in different 1-byte encodings by simply setting the | |
12229 appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual | |
12230 editor if, for example, different fonts are used to display text in | |
12231 different buffers, faces, or windows. The specifier mechanism gives the | |
12232 user complete control over this kind of behavior. | |
12233 @item | |
12234 On output, characters 0--255 are converted into bytes 0--255 and other | |
12235 characters are converted into @samp{~}. | |
12236 @end enumerate | |
12237 | |
12238 @item Qnative | |
12239 Format used for the external Unix environment---@code{argv[]}, stuff | |
12240 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc. | |
12241 This is encoded according to the encoding specified by the current locale. | |
12242 [[This is dangerous; current locale is user preference, and the system | |
12243 is probably going to be something else. Is there anything we can do | |
12244 about it?]] | |
12245 | |
12246 @item Qfile_name | |
12247 Format used for filenames. This is normally the same as @code{Qnative}, | |
12248 but the two should be distinguished for clarity and possible future | |
12249 separation -- and also because @code{Qfile_name} can be changed using either | |
12250 the @code{file-name-coding-system} or @code{pathname-coding-system} (now | |
12251 obsolete) variables. | |
12252 | |
12253 @item Qctext | |
12254 Compound-text format. This is the standard X11 format used for data | |
12255 stored in properties, selections, and the like. This is an 8-bit | |
12256 no-lock-shift ISO2022 coding system. This is a real coding system, | |
12257 unlike @code{Qfile_name}, which is user-definable. | |
12258 | |
12259 @item Qmswindows_tstr | |
12260 Used for external data in all MS Windows functions that are declared to | |
12261 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either | |
12262 @code{Qmswindows_multibyte} (a locale-specific encoding, same as | |
12263 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether | |
12264 XEmacs is being run under Windows 9X or Windows NT/2000/XP. | |
12265 @end table | |
12266 | |
12267 Many other coding systems are provided by default. | |
12268 | |
12269 There are two fundamental macros to convert between external and | |
12270 internal format, as well as various convenience macros to simplify the | |
12271 most common operations. | |
12272 | |
12273 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and | |
12274 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments | |
12275 each of these receives are a source type, a source, a sink type, a sink, | |
12276 and a coding system (or a symbol naming a coding system). | |
12277 | |
12278 A typical call looks like | |
12279 @example | |
12280 TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name); | |
12281 @end example | |
12282 | |
12283 which means that the contents of the lisp string @code{str} are written | |
12284 to a malloc'ed memory area which will be pointed to by @code{ptr}, after | |
12285 the function returns. The conversion will be done using the | |
12286 @code{file-name} coding system, which will be controlled by the user | |
12287 indirectly by setting or binding the variable | |
12288 @code{file-name-coding-system}. | |
12289 | |
12290 Some sources and sinks require two C variables to specify. We use some | |
12291 preprocessor magic to allow different source and sink types, and even | |
12292 different numbers of arguments to specify different types of sources and | |
12293 sinks. | |
12294 | |
12295 So we can have a call that looks like | |
12296 @example | |
12297 TO_INTERNAL_FORMAT (DATA, (ptr, len), | |
12298 MALLOC, (ptr, len), | |
12299 coding_system); | |
12300 @end example | |
12301 | |
12302 The parenthesized argument pairs are required to make the preprocessor | |
12303 magic work. | |
12304 | |
12305 Here are the different source and sink types: | |
12306 | |
12307 @table @code | |
12308 @item @code{DATA, (ptr, len),} | |
12309 input data is a fixed buffer of size @var{len} at address @var{ptr} | |
12310 @item @code{ALLOCA, (ptr, len),} | |
12311 output data is placed in an @code{alloca()}ed buffer of size @var{len} pointed to by @var{ptr} | |
12312 @item @code{MALLOC, (ptr, len),} | |
12313 output data is in a @code{malloc()}ed buffer of size @var{len} pointed to by @var{ptr} | |
12314 @item @code{C_STRING_ALLOCA, ptr,} | |
12315 equivalent to @code{ALLOCA (ptr, len_ignored)} on output. | |
12316 @item @code{C_STRING_MALLOC, ptr,} | |
12317 equivalent to @code{MALLOC (ptr, len_ignored)} on output | |
12318 @item @code{C_STRING, ptr,} | |
12319 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input | |
12320 @item @code{LISP_STRING, string,} | |
12321 input or output is a Lisp_Object of type string | |
12322 @item @code{LISP_BUFFER, buffer,} | |
12323 output is written to @code{(point)} in lisp buffer @var{buffer} | |
12324 @item @code{LISP_LSTREAM, lstream,} | |
12325 input or output is a Lisp_Object of type lstream | |
12326 @item @code{LISP_OPAQUE, object,} | |
12327 input or output is a Lisp_Object of type opaque | |
12328 @end table | |
12329 | |
12330 A source type of @code{C_STRING} or a sink type of | |
12331 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where | |
12332 the external API is not '\0'-byte-clean -- i.e. it expects strings to be | |
12333 terminated with a null byte. For external API's that are in fact | |
12334 '\0'-byte-clean, we should of course not use these. | |
12335 | |
12336 The sinks to be specified must be lvalues, unless they are the lisp | |
12337 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}. | |
12338 | |
12339 There is no problem using the same lvalue for source and sink. | |
12340 | |
12341 Garbage collection is inhibited during these conversion operations, so | |
12342 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}. | |
12343 | |
12344 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the | |
12345 resulting text is stored in a stack-allocated buffer, which is | |
12346 automatically freed on returning from the function. However, the sink | |
12347 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed | |
12348 memory. The caller is responsible for freeing this memory using | |
12349 @code{xfree()}. | |
12350 | |
12351 Note that it doesn't make sense for @code{LISP_STRING} to be a source | |
12352 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}. | |
12353 You'll get an assertion failure if you try. | |
12354 | |
12355 99% of conversions involve raw data or Lisp strings as both source and | |
12356 sink, and usually data is output as @code{alloca()}, or sometimes | |
12357 @code{xmalloc()}. For this reason, convenience macros are defined for | |
12358 many types of conversions involving raw data and/or Lisp strings, | |
12359 especially when the output is an @code{alloca()}ed string. (When the | |
12360 destination is a Lisp string, there are other functions that should be | |
12361 used instead -- @code{build_ext_string()} and @code{make_ext_string()}, | |
12362 for example.) The convenience macros are of two types -- the older kind | |
12363 that store the result into a specified variable, and the newer kind that | |
12364 return the result. The newer kind of macros don't exist when the output | |
12365 is sized data, because that would have two return values. NOTE: All | |
12366 convenience macros are ultimately defined in terms of | |
12367 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any | |
12368 comments above about the workings of these macros also apply to all | |
12369 convenience macros. | |
12370 | |
12371 A typical old-style convenience macro is | |
12372 | |
12373 @example | |
12374 C_STRING_TO_EXTERNAL (in, out, codesys); | |
12375 @end example | |
12376 | |
12377 This is equivalent to | |
12378 | |
12379 @example | |
12380 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys); | |
12381 @end example | |
12382 | |
12383 but is easier to write and somewhat clearer, since it clearly identifies | |
12384 the arguments without the clutter of having the preprocessor types mixed | |
12385 in. | |
12386 | |
12387 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src, | |
12388 codesys)}, which @emph{returns} the converted data (still in | |
12389 @code{alloca()} space). This is far more convenient for most | |
12390 operations. | |
12391 | |
12392 @node General Guidelines for Writing Mule-Aware Code, An Example of Mule-Aware Code, Conversion to and from External Data, Coding for Mule | |
12393 @subsection General Guidelines for Writing Mule-Aware Code | |
12394 @cindex writing Mule-aware code, general guidelines for | |
12395 @cindex Mule-aware code, general guidelines for writing | |
12396 @cindex code, general guidelines for writing Mule-aware | |
12397 | |
12398 This section contains some general guidance on how to write Mule-aware | |
12399 code, as well as some pitfalls you should avoid. | |
12400 | |
12401 @table @emph | |
12402 @item Never use @code{char} and @code{char *}. | |
12403 In XEmacs, the use of @code{char} and @code{char *} is almost always a | |
12404 mistake. If you want to manipulate an Emacs character from ``C'', use | |
12405 @code{Ichar}. If you want to examine a specific octet in the internal | |
12406 format, use @code{Ibyte}. If you want a Lisp-visible character, use a | |
12407 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move | |
12408 through the internal text, use @code{Ibyte *}. Also note that you | |
12409 almost certainly do not need @code{Ichar *}. Other typedefs to clarify | |
12410 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary}, | |
12411 @code{UChar_Binary}, and @code{CIbyte}. | |
12412 | |
12413 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}. | |
12414 The whole point of using different types is to avoid confusion about the | |
12415 use of certain variables. Lest this effect be nullified, you need to be | |
12416 careful about using the right types. | |
12417 | |
12418 @item Always convert external data | |
12419 It is extremely important to always convert external data, because | |
12420 XEmacs can crash if unexpected 8-bit sequences are copied to its internal | |
12421 buffers literally. | |
12422 | |
12423 This means that when a system function, such as @code{readdir}, returns | |
12424 a string, you normally need to convert it using one of the conversion macros | |
12425 described in the previous chapter, before passing it further to Lisp. | |
12426 | |
12427 Actually, most of the basic system functions that accept '\0'-terminated | |
12428 string arguments, like @code{stat()} and @code{open()}, have | |
12429 @strong{encapsulated} equivalents that do the internal to external | |
12430 conversion themselves. The encapsulated equivalents have a @code{qxe_} | |
12431 prefix and have string arguments of type @code{Ibyte *}, and you can | |
12432 pass internally encoded data to them, often from a Lisp string using | |
12433 @code{XSTRING_DATA}. (A better design might be to provide versions that | |
12434 accept Lisp strings directly.) [[Really? Then they'd either take | |
12435 @code{Lisp_Object}s and need to check type, or they'd take | |
12436 @code{Lisp_String}s, and violate the rules about passing any of the | |
12437 specific Lisp types.]] | |
12438 | |
12439 Also note that many internal functions, such as @code{make_string}, | |
12440 accept Ibytes, which removes the need for them to convert the data they | |
12441 receive. This increases efficiency because that way external data needs | |
12442 to be decoded only once, when it is read. After that, it is passed | |
12443 around in internal format. | |
12444 | |
12445 @item Do all work in internal format | |
12446 External-formatted data is completely unpredictable in its format. It | |
12447 may be fixed-width Unicode (not even ASCII compatible); it may be a | |
12448 modal encoding, in | |
12449 which case some occurrences of (e.g.) the slash character may be part of | |
12450 two-byte Asian-language characters, and a naive attempt to split apart a | |
12451 pathname by slashes will fail; etc. Internal-format text should be | |
12452 converted to external format only at the point where an external API is | |
12453 actually called, and the first thing done after receiving | |
12454 external-format text from an external API should be to convert it to | |
12455 internal text. | |
12456 @end table | |
12457 | |
12458 @node An Example of Mule-Aware Code, Mule-izing Code, General Guidelines for Writing Mule-Aware Code, Coding for Mule | |
12459 @subsection An Example of Mule-Aware Code | |
12460 @cindex code, an example of Mule-aware | |
12461 @cindex Mule-aware code, an example of | |
12462 | |
12463 As an example of Mule-aware code, we will analyze the @code{string} | |
12464 function, which conses up a Lisp string from the character arguments it | |
12465 receives. Here is the definition, pasted from @code{alloc.c}: | |
12466 | |
12467 @example | |
12468 @group | |
12469 DEFUN ("string", Fstring, 0, MANY, 0, /* | |
12470 Concatenate all the argument characters and make the result a string. | |
12471 */ | |
12472 (int nargs, Lisp_Object *args)) | |
12473 @{ | |
12474 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN); | |
12475 Ibyte *p = storage; | |
12476 | |
12477 for (; nargs; nargs--, args++) | |
12478 @{ | |
12479 Lisp_Object lisp_char = *args; | |
12480 CHECK_CHAR_COERCE_INT (lisp_char); | |
12481 p += set_itext_ichar (p, XCHAR (lisp_char)); | |
12482 @} | |
12483 return make_string (storage, p - storage); | |
12484 @} | |
12485 @end group | |
12486 @end example | |
12487 | |
12488 Now we can analyze the source line by line. | |
12489 | |
12490 Obviously, string will be as long as there are arguments to the | |
12491 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs} | |
12492 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs} | |
12493 @code{Ichar}s to fit in the string. | |
12494 | |
12495 Then, the loop checks that each element is a character, converting | |
12496 integers in the process. Like many other functions in XEmacs, this | |
12497 function silently accepts integers where characters are expected, for | |
12498 historical and compatibility reasons. Unless you know what you are | |
12499 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)} | |
12500 extracts the @code{Ichar} from the @code{Lisp_Object}, and | |
12501 @code{set_itext_ichar} stores it to storage, increasing @code{p} in | |
12502 the process. | |
12503 | |
12504 Other instructive examples of correct coding under Mule can be found all | |
12505 over the XEmacs code. For starters, I recommend | |
12506 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have | |
12507 understood this section of the manual and studied the examples, you can | |
12508 proceed writing new Mule-aware code. | |
12509 | |
12510 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule | |
12511 @subsection Mule-izing Code | |
12512 | |
12513 A lot of code is written without Mule in mind, and needs to be made | |
12514 Mule-correct or "Mule-ized". There is really no substitute for | |
12515 line-by-line analysis when doing this, but the following checklist can | |
12516 help: | |
12517 | |
12518 @itemize @bullet | |
12519 @item | |
12520 Check all uses of @code{XSTRING_DATA}. | |
12521 @item | |
12522 Check all uses of @code{build_string} and @code{make_string}. | |
12523 @item | |
12524 Check all uses of @code{tolower} and @code{toupper}. | |
12525 @item | |
12526 Check object print methods. | |
12527 @item | |
12528 Check for use of functions such as @code{write_c_string}, | |
12529 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}. | |
12530 @item | |
12531 Check all occurrences of @code{char} and correct to one of the other | |
12532 typedefs described above. | |
12533 @item | |
12534 Check all existing uses of @code{TO_EXTERNAL_FORMAT}, | |
12535 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for | |
12536 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}). | |
12537 @item | |
12538 In Windows code, string literals may need to be encapsulated with @code{XETEXT}. | |
12539 @end itemize | |
12540 | |
12541 @node CCL, Modules for Internationalization, Coding for Mule, Multilingual Support | |
12542 @section CCL | |
12543 @cindex CCL | |
12544 | |
12545 @example | |
12546 MACHINE CODE: | |
12547 | |
12548 The machine code consists of a vector of 32-bit words. | |
12549 The first such word specifies the start of the EOF section of the code; | |
12550 this is the code executed to handle any stuff that needs to be done | |
12551 (e.g. designating back to ASCII and left-to-right mode) after all | |
12552 other encoded/decoded data has been written out. This is not used for | |
12553 charset CCL programs. | |
12554 | |
12555 REGISTER: 0..7 -- referred by RRR or rrr | |
12556 | |
12557 OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT | |
12558 TTTTT (5-bit): operator type | |
12559 RRR (3-bit): register number | |
12560 XXXXXXXXXXXXXXXX (15-bit): | |
12561 CCCCCCCCCCCCCCC: constant or address | |
12562 000000000000rrr: register number | |
12563 | |
12564 AAAA: 00000 + | |
12565 00001 - | |
12566 00010 * | |
12567 00011 / | |
12568 00100 % | |
12569 00101 & | |
12570 00110 | | |
12571 00111 ~ | |
12572 | |
12573 01000 << | |
12574 01001 >> | |
12575 01010 <8 | |
12576 01011 >8 | |
12577 01100 // | |
12578 01101 not used | |
12579 01110 not used | |
12580 01111 not used | |
12581 | |
12582 10000 < | |
12583 10001 > | |
12584 10010 == | |
12585 10011 <= | |
12586 10100 >= | |
12587 10101 != | |
12588 | |
12589 OPERATORS: TTTTT RRR XX.. | |
12590 | |
12591 SetCS: 00000 RRR C...C RRR = C...C | |
12592 SetCL: 00001 RRR ..... RRR = c...c | |
12593 c.............c | |
12594 SetR: 00010 RRR ..rrr RRR = rrr | |
12595 SetA: 00011 RRR ..rrr RRR = array[rrr] | |
12596 C.............C size of array = C...C | |
12597 c.............c contents = c...c | |
12598 | |
12599 Jump: 00100 000 c...c jump to c...c | |
12600 JumpCond: 00101 RRR c...c if (!RRR) jump to c...c | |
12601 WriteJump: 00110 RRR c...c Write1 RRR, jump to c...c | |
12602 WriteReadJump: 00111 RRR c...c Write1, Read1 RRR, jump to c...c | |
12603 WriteCJump: 01000 000 c...c Write1 C...C, jump to c...c | |
12604 C...C | |
12605 WriteCReadJump: 01001 RRR c...c Write1 C...C, Read1 RRR, | |
12606 C.............C and jump to c...c | |
12607 WriteSJump: 01010 000 c...c WriteS, jump to c...c | |
12608 C.............C | |
12609 S.............S | |
12610 ... | |
12611 WriteSReadJump: 01011 RRR c...c WriteS, Read1 RRR, jump to c...c | |
12612 C.............C | |
12613 S.............S | |
12614 ... | |
12615 WriteAReadJump: 01100 RRR c...c WriteA, Read1 RRR, jump to c...c | |
12616 C.............C size of array = C...C | |
12617 c.............c contents = c...c | |
12618 ... | |
12619 Branch: 01101 RRR C...C if (RRR >= 0 && RRR < C..) | |
12620 c.............c branch to (RRR+1)th address | |
12621 Read1: 01110 RRR ... read 1-byte to RRR | |
12622 Read2: 01111 RRR ..rrr read 2-byte to RRR and rrr | |
12623 ReadBranch: 10000 RRR C...C Read1 and Branch | |
12624 c.............c | |
12625 ... | |
12626 Write1: 10001 RRR ..... write 1-byte RRR | |
12627 Write2: 10010 RRR ..rrr write 2-byte RRR and rrr | |
12628 WriteC: 10011 000 ..... write 1-char C...CC | |
12629 C.............C | |
12630 WriteS: 10100 000 ..... write C..-byte of string | |
12631 C.............C | |
12632 S.............S | |
12633 ... | |
12634 WriteA: 10101 RRR ..... write array[RRR] | |
12635 C.............C size of array = C...C | |
12636 c.............c contents = c...c | |
12637 ... | |
12638 End: 10110 000 ..... terminate the execution | |
12639 | |
12640 SetSelfCS: 10111 RRR C...C RRR AAAAA= C...C | |
12641 ..........AAAAA | |
12642 SetSelfCL: 11000 RRR ..... RRR AAAAA= c...c | |
12643 c.............c | |
12644 ..........AAAAA | |
12645 SetSelfR: 11001 RRR ..Rrr RRR AAAAA= rrr | |
12646 ..........AAAAA | |
12647 SetExprCL: 11010 RRR ..Rrr RRR = rrr AAAAA c...c | |
12648 c.............c | |
12649 ..........AAAAA | |
12650 SetExprR: 11011 RRR ..rrr RRR = rrr AAAAA Rrr | |
12651 ............Rrr | |
12652 ..........AAAAA | |
12653 JumpCondC: 11100 RRR c...c if !(RRR AAAAA C..) jump to c...c | |
12654 C.............C | |
12655 ..........AAAAA | |
12656 JumpCondR: 11101 RRR c...c if !(RRR AAAAA rrr) jump to c...c | |
12657 ............rrr | |
12658 ..........AAAAA | |
12659 ReadJumpCondC: 11110 RRR c...c Read1 and JumpCondC | |
12660 C.............C | |
12661 ..........AAAAA | |
12662 ReadJumpCondR: 11111 RRR c...c Read1 and JumpCondR | |
12663 ............rrr | |
12664 ..........AAAAA | |
12665 @end example | |
12666 | |
12667 @node Modules for Internationalization, , CCL, Multilingual Support | |
12668 @section Modules for Internationalization | |
12669 @cindex modules for internationalization | |
12670 @cindex internationalization, modules for | |
12671 | |
12672 @example | |
12673 @file{mule-canna.c} | |
12674 @file{mule-ccl.c} | |
12675 @file{mule-charset.c} | |
12676 @file{mule-charset.h} | |
12677 @file{file-coding.c} | |
12678 @file{file-coding.h} | |
12679 @file{mule-coding.c} | |
12680 @file{mule-mcpath.c} | |
12681 @file{mule-mcpath.h} | |
12682 @file{mule-wnnfns.c} | |
12683 @file{mule.c} | |
12684 @end example | |
12685 | |
12686 These files implement the MULE (Asian-language) support. Note that MULE | |
12687 actually provides a general interface for all sorts of languages, not | |
12688 just Asian languages (although they are generally the most complicated | |
12689 to support). This code is still in beta. | |
12690 | |
12691 @file{mule-charset.*} and @file{file-coding.*} provide the heart of the | |
12692 XEmacs MULE support. @file{mule-charset.*} implements the @dfn{charset} | |
12693 Lisp object type, which encapsulates a character set (an ordered one- or | |
12694 two-dimensional set of characters, such as US ASCII or JISX0208 Japanese | |
12695 Kanji). | |
12696 | |
12697 @file{file-coding.*} implements the @dfn{coding-system} Lisp object | |
12698 type, which encapsulates a method of converting between different | |
12699 encodings. An encoding is a representation of a stream of characters, | |
12700 possibly from multiple character sets, using a stream of bytes or words, | |
12701 and defines (e.g.) which escape sequences are used to specify particular | |
12702 character sets, how the indices for a character are converted into bytes | |
12703 (sometimes this involves setting the high bit; sometimes complicated | |
12704 rearranging of the values takes place, as in the Shift-JIS encoding), | |
12705 etc. It also contains some generic coding system implementations, such | |
12706 as the binary (no-conversion) coding system and a sample gzip coding system. | |
12707 | |
12708 @file{mule-coding.c} contains the implementations of text coding systems. | |
12709 | |
12710 @file{mule-ccl.c} provides the CCL (Code Conversion Language) | |
12711 interpreter. CCL is similar in spirit to Lisp byte code and is used to | |
12712 implement converters for custom encodings. | |
12713 | |
12714 @file{mule-canna.c} and @file{mule-wnnfns.c} implement interfaces to | |
12715 external programs used to implement the Canna and WNN input methods, | |
12716 respectively. This is currently in beta. | |
12717 | |
12718 @file{mule-mcpath.c} provides some functions to allow for pathnames | |
12719 containing extended characters. This code is fragmentary, obsolete, and | |
12720 completely non-working. Instead, @code{pathname-coding-system} is used | |
12721 to specify conversions of names of files and directories. The standard | |
12722 C I/O functions like @samp{open()} are wrapped so that conversion occurs | |
12723 automatically. | |
12724 | |
12725 @file{mule.c} contains a few miscellaneous things. It currently seems | |
12726 to be unused and probably should be removed. | |
12727 | |
12728 | |
12729 | |
12730 @example | |
12731 @file{intl.c} | |
12732 @end example | |
12733 | |
12734 This provides some miscellaneous internationalization code for | |
12735 implementing message translation and interfacing to the Ximp input | |
12736 method. None of this code is currently working. | |
12737 | |
12738 | |
12739 | |
12740 @example | |
12741 @file{iso-wide.h} | |
12742 @end example | |
12743 | |
12744 This contains leftover code from an earlier implementation of | |
12745 Asian-language support, and is not currently used. | |
12746 | |
12747 | |
12748 @node The Lisp Reader and Compiler, Lstreams, Multilingual Support, Top | |
12749 @chapter The Lisp Reader and Compiler | |
12750 @cindex Lisp reader and compiler, the | |
12751 @cindex reader and compiler, the Lisp | |
12752 @cindex compiler, the Lisp reader and | |
12753 | |
12754 Not yet documented. | |
12755 | |
12756 @node Lstreams, Consoles; Devices; Frames; Windows, The Lisp Reader and Compiler, Top | |
12757 @chapter Lstreams | 14846 @chapter Lstreams |
12758 @cindex lstreams | 14847 @cindex lstreams |
12759 | 14848 |
12760 An @dfn{lstream} is an internal Lisp object that provides a generic | 14849 An @dfn{lstream} is an internal Lisp object that provides a generic |
12761 buffering stream implementation. Conceptually, you send data to the | 14850 buffering stream implementation. Conceptually, you send data to the |
12981 @deftypefn {Lstream Method} Lisp_Object marker (Lisp_Object @var{lstream}, void (*@var{markfun}) (Lisp_Object)) | 15070 @deftypefn {Lstream Method} Lisp_Object marker (Lisp_Object @var{lstream}, void (*@var{markfun}) (Lisp_Object)) |
12982 Mark this object for garbage collection. Same semantics as a standard | 15071 Mark this object for garbage collection. Same semantics as a standard |
12983 @code{Lisp_Object} marker. This function can be @code{NULL}. | 15072 @code{Lisp_Object} marker. This function can be @code{NULL}. |
12984 @end deftypefn | 15073 @end deftypefn |
12985 | 15074 |
12986 @node Consoles; Devices; Frames; Windows, The Redisplay Mechanism, Lstreams, Top | 15075 @node Subprocesses, Interface to MS Windows, Lstreams, Top |
12987 @chapter Consoles; Devices; Frames; Windows | |
12988 @cindex consoles; devices; frames; windows | |
12989 @cindex devices; frames; windows, consoles; | |
12990 @cindex frames; windows, consoles; devices; | |
12991 @cindex windows, consoles; devices; frames; | |
12992 | |
12993 @menu | |
12994 * Introduction to Consoles; Devices; Frames; Windows:: | |
12995 * Point:: | |
12996 * Window Hierarchy:: | |
12997 * The Window Object:: | |
12998 * Modules for the Basic Displayable Lisp Objects:: | |
12999 @end menu | |
13000 | |
13001 @node Introduction to Consoles; Devices; Frames; Windows, Point, Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows | |
13002 @section Introduction to Consoles; Devices; Frames; Windows | |
13003 @cindex consoles; devices; frames; windows, introduction to | |
13004 @cindex devices; frames; windows, introduction to consoles; | |
13005 @cindex frames; windows, introduction to consoles; devices; | |
13006 @cindex windows, introduction to consoles; devices; frames; | |
13007 | |
13008 A window-system window that you see on the screen is called a | |
13009 @dfn{frame} in Emacs terminology. Each frame is subdivided into one or | |
13010 more non-overlapping panes, called (confusingly) @dfn{windows}. Each | |
13011 window displays the text of a buffer in it. (See above on Buffers.) Note | |
13012 that buffers and windows are independent entities: Two or more windows | |
13013 can be displaying the same buffer (potentially in different locations), | |
13014 and a buffer can be displayed in no windows. | |
13015 | |
13016 A single display screen that contains one or more frames is called | |
13017 a @dfn{display}. Under most circumstances, there is only one display. | |
13018 However, more than one display can exist, for example if you have | |
13019 a @dfn{multi-headed} console, i.e. one with a single keyboard but | |
13020 multiple displays. (Typically in such a situation, the various | |
13021 displays act like one large display, in that the mouse is only | |
13022 in one of them at a time, and moving the mouse off of one moves | |
13023 it into another.) In some cases, the different displays will | |
13024 have different characteristics, e.g. one color and one mono. | |
13025 | |
13026 XEmacs can display frames on multiple displays. It can even deal | |
13027 simultaneously with frames on multiple keyboards (called @dfn{consoles} in | |
13028 XEmacs terminology). Here is one case where this might be useful: You | |
13029 are using XEmacs on your workstation at work, and leave it running. | |
13030 Then you go home and dial in on a TTY line, and you can use the | |
13031 already-running XEmacs process to display another frame on your local | |
13032 TTY. | |
13033 | |
13034 Thus, there is a hierarchy console -> display -> frame -> window. | |
13035 There is a separate Lisp object type for each of these four concepts. | |
13036 Furthermore, there is logically a @dfn{selected console}, | |
13037 @dfn{selected display}, @dfn{selected frame}, and @dfn{selected window}. | |
13038 Each of these objects is distinguished in various ways, such as being the | |
13039 default object for various functions that act on objects of that type. | |
13040 Note that every containing object remembers the ``selected'' object | |
13041 among the objects that it contains: e.g. not only is there a selected | |
13042 window, but every frame remembers the last window in it that was | |
13043 selected, and changing the selected frame causes the remembered window | |
13044 within it to become the selected window. Similar relationships apply | |
13045 for consoles to devices and devices to frames. | |
13046 | |
13047 @node Point, Window Hierarchy, Introduction to Consoles; Devices; Frames; Windows, Consoles; Devices; Frames; Windows | |
13048 @section Point | |
13049 @cindex point | |
13050 | |
13051 Recall that every buffer has a current insertion position, called | |
13052 @dfn{point}. Now, two or more windows may be displaying the same buffer, | |
13053 and the text cursor in the two windows (i.e. @code{point}) can be in | |
13054 two different places. You may ask, how can that be, since each | |
13055 buffer has only one value of @code{point}? The answer is that each window | |
13056 also has a value of @code{point} that is squirreled away in it. There | |
13057 is only one selected window, and the value of ``point'' in that buffer | |
13058 corresponds to that window. When the selected window is changed | |
13059 from one window to another displaying the same buffer, the old | |
13060 value of @code{point} is stored into the old window's ``point'' and the | |
13061 value of @code{point} from the new window is retrieved and made the | |
13062 value of @code{point} in the buffer. This means that @code{window-point} | |
13063 for the selected window is potentially inaccurate, and if you | |
13064 want to retrieve the correct value of @code{point} for a window, | |
13065 you must special-case on the selected window and retrieve the | |
13066 buffer's point instead. This is related to why @code{save-window-excursion} | |
13067 does not save the selected window's value of @code{point}. | |
13068 | |
13069 @node Window Hierarchy, The Window Object, Point, Consoles; Devices; Frames; Windows | |
13070 @section Window Hierarchy | |
13071 @cindex window hierarchy | |
13072 @cindex hierarchy of windows | |
13073 | |
13074 If a frame contains multiple windows (panes), they are always created | |
13075 by splitting an existing window along the horizontal or vertical axis. | |
13076 Terminology is a bit confusing here: to @dfn{split a window | |
13077 horizontally} means to create two side-by-side windows, i.e. to make a | |
13078 @emph{vertical} cut in a window. Likewise, to @dfn{split a window | |
13079 vertically} means to create two windows, one above the other, by making | |
13080 a @emph{horizontal} cut. | |
13081 | |
13082 If you split a window and then split again along the same axis, you | |
13083 will end up with a number of panes all arranged along the same axis. | |
13084 The precise way in which the splits were made should not be important, | |
13085 and this is reflected internally. Internally, all windows are arranged | |
13086 in a tree, consisting of two types of windows, @dfn{combination} windows | |
13087 (which have children, and are covered completely by those children) and | |
13088 @dfn{leaf} windows, which have no children and are visible. Every | |
13089 combination window has two or more children, all arranged along the same | |
13090 axis. There are (logically) two subtypes of windows, depending on | |
13091 whether their children are horizontally or vertically arrayed. There is | |
13092 always one root window, which is either a leaf window (if the frame | |
13093 contains only one window) or a combination window (if the frame contains | |
13094 more than one window). In the latter case, the root window will have | |
13095 two or more children, either horizontally or vertically arrayed, and | |
13096 each of those children will be either a leaf window or another | |
13097 combination window. | |
13098 | |
13099 Here are some rules: | |
13100 | |
13101 @enumerate | |
13102 @item | |
13103 Horizontal combination windows can never have children that are | |
13104 horizontal combination windows; same for vertical. | |
13105 | |
13106 @item | |
13107 Only leaf windows can be split (obviously) and this splitting does one | |
13108 of two things: (a) turns the leaf window into a combination window and | |
13109 creates two new leaf children, or (b) turns the leaf window into one of | |
13110 the two new leaves and creates the other leaf. Rule (1) dictates which | |
13111 of these two outcomes happens. | |
13112 | |
13113 @item | |
13114 Every combination window must have at least two children. | |
13115 | |
13116 @item | |
13117 Leaf windows can never become combination windows. They can be deleted, | |
13118 however. If this results in a violation of (3), the parent combination | |
13119 window also gets deleted. | |
13120 | |
13121 @item | |
13122 All functions that accept windows must be prepared to accept combination | |
13123 windows, and do something sane (e.g. signal an error if so). | |
13124 Combination windows @emph{do} escape to the Lisp level. | |
13125 | |
13126 @item | |
13127 All windows have three fields governing their contents: | |
13128 these are @dfn{hchild} (a list of horizontally-arrayed children), | |
13129 @dfn{vchild} (a list of vertically-arrayed children), and @dfn{buffer} | |
13130 (the buffer contained in a leaf window). Exactly one of | |
13131 these will be non-@code{nil}. Remember that @dfn{horizontally-arrayed} | |
13132 means ``side-by-side'' and @dfn{vertically-arrayed} means | |
13133 @dfn{one above the other}. | |
13134 | |
13135 @item | |
13136 Leaf windows also have markers in their @code{start} (the | |
13137 first buffer position displayed in the window) and @code{pointm} | |
13138 (the window's stashed value of @code{point}---see above) fields, | |
13139 while combination windows have @code{nil} in these fields. | |
13140 | |
13141 @item | |
13142 The list of children for a window is threaded through the | |
13143 @code{next} and @code{prev} fields of each child window. | |
13144 | |
13145 @item | |
13146 @strong{Deleted windows can be undeleted}. This happens as a result of | |
13147 restoring a window configuration, and is unlike frames, displays, and | |
13148 consoles, which, once deleted, can never be restored. Deleting a window | |
13149 does nothing except set a special @code{dead} bit to 1 and clear out the | |
13150 @code{next}, @code{prev}, @code{hchild}, and @code{vchild} fields, for | |
13151 GC purposes. | |
13152 | |
13153 @item | |
13154 Most frames actually have two top-level windows---one for the | |
13155 minibuffer and one (the @dfn{root}) for everything else. The modeline | |
13156 (if present) separates these two. The @code{next} field of the root | |
13157 points to the minibuffer, and the @code{prev} field of the minibuffer | |
13158 points to the root. The other @code{next} and @code{prev} fields are | |
13159 @code{nil}, and the frame points to both of these windows. | |
13160 Minibuffer-less frames have no minibuffer window, and the @code{next} | |
13161 and @code{prev} of the root window are @code{nil}. Minibuffer-only | |
13162 frames have no root window, and the @code{next} of the minibuffer window | |
13163 is @code{nil} but the @code{prev} points to itself. (#### This is an | |
13164 artifact that should be fixed.) | |
13165 @end enumerate | |
13166 | |
13167 @node The Window Object, Modules for the Basic Displayable Lisp Objects, Window Hierarchy, Consoles; Devices; Frames; Windows | |
13168 @section The Window Object | |
13169 @cindex window object, the | |
13170 @cindex object, the window | |
13171 | |
13172 Windows have the following accessible fields: | |
13173 | |
13174 @table @code | |
13175 @item frame | |
13176 The frame that this window is on. | |
13177 | |
13178 @item mini_p | |
13179 Non-@code{nil} if this window is a minibuffer window. | |
13180 | |
13181 @item buffer | |
13182 The buffer that the window is displaying. This may change often during | |
13183 the life of the window. | |
13184 | |
13185 @item dedicated | |
13186 Non-@code{nil} if this window is dedicated to its buffer. | |
13187 | |
13188 @item pointm | |
13189 @cindex window point internals | |
13190 This is the value of point in the current buffer when this window is | |
13191 selected; when it is not selected, it retains its previous value. | |
13192 | |
13193 @item start | |
13194 The position in the buffer that is the first character to be displayed | |
13195 in the window. | |
13196 | |
13197 @item force_start | |
13198 If this flag is non-@code{nil}, it says that the window has been | |
13199 scrolled explicitly by the Lisp program. This affects what the next | |
13200 redisplay does if point is off the screen: instead of scrolling the | |
13201 window to show the text around point, it moves point to a location that | |
13202 is on the screen. | |
13203 | |
13204 @item last_modified | |
13205 The @code{modified} field of the window's buffer, as of the last time | |
13206 a redisplay completed in this window. | |
13207 | |
13208 @item last_point | |
13209 The buffer's value of point, as of the last time | |
13210 a redisplay completed in this window. | |
13211 | |
13212 @item left | |
13213 This is the left-hand edge of the window, measured in columns. (The | |
13214 leftmost column on the screen is @w{column 0}.) | |
13215 | |
13216 @item top | |
13217 This is the top edge of the window, measured in lines. (The top line on | |
13218 the screen is @w{line 0}.) | |
13219 | |
13220 @item height | |
13221 The height of the window, measured in lines. | |
13222 | |
13223 @item width | |
13224 The width of the window, measured in columns. | |
13225 | |
13226 @item next | |
13227 This is the window that is the next in the chain of siblings. It is | |
13228 @code{nil} in a window that is the rightmost or bottommost of a group of | |
13229 siblings. | |
13230 | |
13231 @item prev | |
13232 This is the window that is the previous in the chain of siblings. It is | |
13233 @code{nil} in a window that is the leftmost or topmost of a group of | |
13234 siblings. | |
13235 | |
13236 @item parent | |
13237 Internally, XEmacs arranges windows in a tree; each group of siblings has | |
13238 a parent window whose area includes all the siblings. This field points | |
13239 to a window's parent. | |
13240 | |
13241 Parent windows do not display buffers, and play little role in display | |
13242 except to shape their child windows. Emacs Lisp programs usually have | |
13243 no access to the parent windows; they operate on the windows at the | |
13244 leaves of the tree, which actually display buffers. | |
13245 | |
13246 @item hscroll | |
13247 This is the number of columns that the display in the window is scrolled | |
13248 horizontally to the left. Normally, this is 0. | |
13249 | |
13250 @item use_time | |
13251 This is the last time that the window was selected. The function | |
13252 @code{get-lru-window} uses this field. | |
13253 | |
13254 @item display_table | |
13255 The window's display table, or @code{nil} if none is specified for it. | |
13256 | |
13257 @item update_mode_line | |
13258 Non-@code{nil} means this window's mode line needs to be updated. | |
13259 | |
13260 @item base_line_number | |
13261 The line number of a certain position in the buffer, or @code{nil}. | |
13262 This is used for displaying the line number of point in the mode line. | |
13263 | |
13264 @item base_line_pos | |
13265 The position in the buffer for which the line number is known, or | |
13266 @code{nil} meaning none is known. | |
13267 | |
13268 @item region_showing | |
13269 If the region (or part of it) is highlighted in this window, this field | |
13270 holds the mark position that made one end of that region. Otherwise, | |
13271 this field is @code{nil}. | |
13272 @end table | |
13273 | |
13274 @node Modules for the Basic Displayable Lisp Objects, , The Window Object, Consoles; Devices; Frames; Windows | |
13275 @section Modules for the Basic Displayable Lisp Objects | |
13276 @cindex modules for the basic displayable Lisp objects | |
13277 @cindex displayable Lisp objects, modules for the basic | |
13278 @cindex Lisp objects, modules for the basic displayable | |
13279 @cindex objects, modules for the basic displayable Lisp | |
13280 | |
13281 @example | |
13282 @file{console-msw.c} | |
13283 @file{console-msw.h} | |
13284 @file{console-stream.c} | |
13285 @file{console-stream.h} | |
13286 @file{console-tty.c} | |
13287 @file{console-tty.h} | |
13288 @file{console-x.c} | |
13289 @file{console-x.h} | |
13290 @file{console.c} | |
13291 @file{console.h} | |
13292 @end example | |
13293 | |
13294 These modules implement the @dfn{console} Lisp object type. A console | |
13295 contains multiple display devices, but only one keyboard and mouse. | |
13296 Most of the time, a console will contain exactly one device. | |
13297 | |
13298 Consoles are the top of a lisp object inclusion hierarchy. Consoles | |
13299 contain devices, which contain frames, which contain windows. | |
13300 | |
13301 | |
13302 | |
13303 @example | |
13304 @file{device-msw.c} | |
13305 @file{device-tty.c} | |
13306 @file{device-x.c} | |
13307 @file{device.c} | |
13308 @file{device.h} | |
13309 @end example | |
13310 | |
13311 These modules implement the @dfn{device} Lisp object type. This | |
13312 abstracts a particular screen or connection on which frames are | |
13313 displayed. As with Lisp objects, event interfaces, and other | |
13314 subsystems, the device code is separated into a generic component that | |
13315 contains a standardized interface (in the form of a set of methods) onto | |
13316 particular device types. | |
13317 | |
13318 The device subsystem defines all the methods and provides method | |
13319 services for not only device operations but also for the frame, window, | |
13320 menubar, scrollbar, toolbar, and other displayable-object subsystems. | |
13321 The reason for this is that all of these subsystems have the same | |
13322 subtypes (X, TTY, NeXTstep, Microsoft Windows, etc.) as devices do. | |
13323 | |
13324 | |
13325 | |
13326 @example | |
13327 @file{frame-msw.c} | |
13328 @file{frame-tty.c} | |
13329 @file{frame-x.c} | |
13330 @file{frame.c} | |
13331 @file{frame.h} | |
13332 @end example | |
13333 | |
13334 Each device contains one or more frames in which objects (e.g. text) are | |
13335 displayed. A frame corresponds to a window in the window system; | |
13336 usually this is a top-level window but it could potentially be one of a | |
13337 number of overlapping child windows within a top-level window, using the | |
13338 MDI (Multiple Document Interface) protocol in Microsoft Windows or a | |
13339 similar scheme. | |
13340 | |
13341 The @file{frame-*} files implement the @dfn{frame} Lisp object type and | |
13342 provide the generic and device-type-specific operations on frames | |
13343 (e.g. raising, lowering, resizing, moving, etc.). | |
13344 | |
13345 | |
13346 | |
13347 @example | |
13348 @file{window.c} | |
13349 @file{window.h} | |
13350 @end example | |
13351 | |
13352 @cindex window (in Emacs) | |
13353 @cindex pane | |
13354 Each frame consists of one or more non-overlapping @dfn{windows} (better | |
13355 known as @dfn{panes} in standard window-system terminology) in which a | |
13356 buffer's text can be displayed. Windows can also have scrollbars | |
13357 displayed around their edges. | |
13358 | |
13359 @file{window.c} and @file{window.h} implement the @dfn{window} Lisp | |
13360 object type and provide code to manage windows. Since windows have no | |
13361 associated resources in the window system (the window system knows only | |
13362 about the frame; no child windows or anything are used for XEmacs | |
13363 windows), there is no device-type-specific code here; all of that code | |
13364 is part of the redisplay mechanism or the code for particular object | |
13365 types such as scrollbars. | |
13366 | |
13367 @node The Redisplay Mechanism, Extents, Consoles; Devices; Frames; Windows, Top | |
13368 @chapter The Redisplay Mechanism | |
13369 @cindex redisplay mechanism, the | |
13370 | |
13371 The redisplay mechanism is one of the most complicated sections of | |
13372 XEmacs, especially from a conceptual standpoint. This is doubly so | |
13373 because, unlike for the basic aspects of the Lisp interpreter, the | |
13374 computer science theories of how to efficiently handle redisplay are not | |
13375 well-developed. | |
13376 | |
13377 When working with the redisplay mechanism, remember the Golden Rules | |
13378 of Redisplay: | |
13379 | |
13380 @enumerate | |
13381 @item | |
13382 It Is Better To Be Correct Than Fast. | |
13383 @item | |
13384 Thou Shalt Not Run Elisp From Within Redisplay. | |
13385 @item | |
13386 It Is Better To Be Fast Than Not To Be. | |
13387 @end enumerate | |
13388 | |
13389 @menu | |
13390 * Critical Redisplay Sections:: | |
13391 * Line Start Cache:: | |
13392 * Redisplay Piece by Piece:: | |
13393 * Modules for the Redisplay Mechanism:: | |
13394 * Modules for other Display-Related Lisp Objects:: | |
13395 @end menu | |
13396 | |
13397 @node Critical Redisplay Sections, Line Start Cache, The Redisplay Mechanism, The Redisplay Mechanism | |
13398 @section Critical Redisplay Sections | |
13399 @cindex redisplay sections, critical | |
13400 @cindex critical redisplay sections | |
13401 | |
13402 Within this section, we are defenseless and assume that the | |
13403 following cannot happen: | |
13404 | |
13405 @enumerate | |
13406 @item | |
13407 garbage collection | |
13408 @item | |
13409 Lisp code evaluation | |
13410 @item | |
13411 frame size changes | |
13412 @end enumerate | |
13413 | |
13414 We ensure (3) by calling @code{hold_frame_size_changes()}, which | |
13415 will cause any pending frame size changes to get put on hold | |
13416 till after the end of the critical section. (1) follows | |
13417 automatically if (2) is met. #### Unfortunately, there are | |
13418 some places where Lisp code can be called within this section. | |
13419 We need to remove them. | |
13420 | |
13421 If @code{Fsignal()} is called during this critical section, we | |
13422 will @code{abort()}. | |
13423 | |
13424 If garbage collection is called during this critical section, | |
13425 we simply return. #### We should abort instead. | |
13426 | |
13427 #### If a frame-size change does occur we should probably | |
13428 actually be preempting redisplay. | |
13429 | |
13430 @node Line Start Cache, Redisplay Piece by Piece, Critical Redisplay Sections, The Redisplay Mechanism | |
13431 @section Line Start Cache | |
13432 @cindex line start cache | |
13433 | |
13434 The traditional scrolling code in Emacs breaks in a variable height | |
13435 world. It depends on the key assumption that the number of lines that | |
13436 can be displayed at any given time is fixed. This led to a complete | |
13437 separation of the scrolling code from the redisplay code. In order to | |
13438 fully support variable height lines, the scrolling code must actually be | |
13439 tightly integrated with redisplay. Only redisplay can determine how | |
13440 many lines will be displayed on a screen for any given starting point. | |
13441 | |
13442 What is ideally wanted is a complete list of the starting buffer | |
13443 position for every possible display line of a buffer along with the | |
13444 height of that display line. Maintaining such a full list would be very | |
13445 expensive. We settle for having it include information for all areas | |
13446 which we happen to generate anyhow (i.e. the region currently being | |
13447 displayed) and for those areas we need to work with. | |
13448 | |
13449 In order to ensure that the cache accurately represents what redisplay | |
13450 would actually show, it is necessary to invalidate it in many | |
13451 situations. If the buffer changes, the starting positions may no longer | |
13452 be correct. If a face or an extent has changed then the line heights | |
13453 may have altered. These events happen frequently enough that the cache | |
13454 can end up being constantly disabled. With this potentially constant | |
13455 invalidation when is the cache ever useful? | |
13456 | |
13457 Even if the cache is invalidated before every single usage, it is | |
13458 necessary. Scrolling often requires knowledge about display lines which | |
13459 are actually above or below the visible region. The cache provides a | |
13460 convenient light-weight method of storing this information for multiple | |
13461 display regions. This knowledge is necessary for the scrolling code to | |
13462 always obey the First Golden Rule of Redisplay. | |
13463 | |
13464 If the cache already contains all of the information that the scrolling | |
13465 routines happen to need so that it doesn't have to go generate it, then | |
13466 we are able to obey the Third Golden Rule of Redisplay. The first thing | |
13467 we do to help out the cache is to always add the displayed region. This | |
13468 region had to be generated anyway, so the cache ends up getting the | |
13469 information basically for free. In those cases where a user is simply | |
13470 scrolling around viewing a buffer there is a high probability that this | |
13471 is sufficient to always provide the needed information. The second | |
13472 thing we can do is be smart about invalidating the cache. | |
13473 | |
13474 TODO---Be smart about invalidating the cache. Potential places: | |
13475 | |
13476 @itemize @bullet | |
13477 @item | |
13478 Insertions at end-of-line which don't cause line-wraps do not alter the | |
13479 starting positions of any display lines. These types of buffer | |
13480 modifications should not invalidate the cache. This is actually a large | |
13481 optimization for redisplay speed as well. | |
13482 @item | |
13483 Buffer modifications frequently only affect the display of lines at and | |
13484 below where they occur. In these situations we should only invalidate | |
13485 the part of the cache starting at where the modification occurs. | |
13486 @end itemize | |
13487 | |
13488 In case you're wondering, the Second Golden Rule of Redisplay is not | |
13489 applicable. | |
13490 | |
13491 @node Redisplay Piece by Piece, Modules for the Redisplay Mechanism, Line Start Cache, The Redisplay Mechanism | |
13492 @section Redisplay Piece by Piece | |
13493 @cindex redisplay piece by piece | |
13494 | |
13495 As you can begin to see redisplay is complex and also not well | |
13496 documented. Chuck no longer works on XEmacs so this section is my take | |
13497 on the workings of redisplay. | |
13498 | |
13499 Redisplay happens in three phases: | |
13500 | |
13501 @enumerate | |
13502 @item | |
13503 Determine desired display in area that needs redisplay. | |
13504 Implemented by @code{redisplay.c} | |
13505 @item | |
13506 Compare desired display with current display | |
13507 Implemented by @code{redisplay-output.c} | |
13508 @item | |
13509 Output changes Implemented by @code{redisplay-output.c}, | |
13510 @code{redisplay-x.c}, @code{redisplay-msw.c} and @code{redisplay-tty.c} | |
13511 @end enumerate | |
13512 | |
13513 Steps 1 and 2 are device-independent and relatively complex. Step 3 is | |
13514 mostly device-dependent. | |
13515 | |
13516 Determining the desired display | |
13517 | |
13518 Display attributes are stored in @code{display_line} structures. Each | |
13519 @code{display_line} consists of a set of @code{display_block}'s and each | |
13520 @code{display_block} contains a number of @code{rune}'s. Generally | |
13521 dynarr's of @code{display_line}'s are held by each window representing | |
13522 the current display and the desired display. | |
13523 | |
13524 The @code{display_line} structures are tightly tied to buffers which | |
13525 presents a problem for redisplay as this connection is bogus for the | |
13526 modeline. Hence the @code{display_line} generation routines are | |
13527 duplicated for generating the modeline. This means that the modeline | |
13528 display code has many bugs that the standard redisplay code does not. | |
13529 | |
13530 The guts of @code{display_line} generation are in | |
13531 @code{create_text_block}, which creates a single display line for the | |
13532 desired locale. This incrementally parses the characters on the current | |
13533 line and generates redisplay structures for each. | |
13534 | |
13535 Gutter redisplay is different. Because the data to display is stored in | |
13536 a string we cannot use @code{create_text_block}. Instead we use | |
13537 @code{create_text_string_block} which performs the same function as | |
13538 @code{create_text_block} but for strings. Many of the complexities of | |
13539 @code{create_text_block} to do with cursor handling and selective | |
13540 display have been removed. | |
13541 | |
13542 @node Modules for the Redisplay Mechanism, Modules for other Display-Related Lisp Objects, Redisplay Piece by Piece, The Redisplay Mechanism | |
13543 @section Modules for the Redisplay Mechanism | |
13544 @cindex modules for the redisplay mechanism | |
13545 @cindex redisplay mechanism, modules for the | |
13546 | |
13547 @example | |
13548 @file{redisplay-output.c} | |
13549 @file{redisplay-msw.c} | |
13550 @file{redisplay-tty.c} | |
13551 @file{redisplay-x.c} | |
13552 @file{redisplay.c} | |
13553 @file{redisplay.h} | |
13554 @end example | |
13555 | |
13556 These files provide the redisplay mechanism. As with many other | |
13557 subsystems in XEmacs, there is a clean separation between the general | |
13558 and device-specific support. | |
13559 | |
13560 @file{redisplay.c} contains the bulk of the redisplay engine. These | |
13561 functions update the redisplay structures (which describe how the screen | |
13562 is to appear) to reflect any changes made to the state of any | |
13563 displayable objects (buffer, frame, window, etc.) since the last time | |
13564 that redisplay was called. These functions are highly optimized to | |
13565 avoid doing more work than necessary (since redisplay is called | |
13566 extremely often and is potentially a huge time sink), and depend heavily | |
13567 on notifications from the objects themselves that changes have occurred, | |
13568 so that redisplay doesn't explicitly have to check each possible object. | |
13569 The redisplay mechanism also contains a great deal of caching to further | |
13570 speed things up; some of this caching is contained within the various | |
13571 displayable objects. | |
13572 | |
13573 @file{redisplay-output.c} goes through the redisplay structures and converts | |
13574 them into calls to device-specific methods to actually output the screen | |
13575 changes. | |
13576 | |
13577 @file{redisplay-x.c} and @file{redisplay-tty.c} are two implementations | |
13578 of these redisplay output methods, for X frames and TTY frames, | |
13579 respectively. | |
13580 | |
13581 | |
13582 | |
13583 @example | |
13584 @file{indent.c} | |
13585 @end example | |
13586 | |
13587 This module contains various functions and Lisp primitives for | |
13588 converting between buffer positions and screen positions. These | |
13589 functions call the redisplay mechanism to do most of the work, and then | |
13590 examine the redisplay structures to get the necessary information. This | |
13591 module needs work. | |
13592 | |
13593 | |
13594 | |
13595 @example | |
13596 @file{termcap.c} | |
13597 @file{terminfo.c} | |
13598 @file{tparam.c} | |
13599 @end example | |
13600 | |
13601 These files contain functions for working with the termcap (BSD-style) | |
13602 and terminfo (System V style) databases of terminal capabilities and | |
13603 escape sequences, used when XEmacs is displaying in a TTY. | |
13604 | |
13605 | |
13606 | |
13607 @example | |
13608 @file{cm.c} | |
13609 @file{cm.h} | |
13610 @end example | |
13611 | |
13612 These files provide some miscellaneous TTY-output functions and should | |
13613 probably be merged into @file{redisplay-tty.c}. | |
13614 | |
13615 | |
13616 | |
13617 @node Modules for other Display-Related Lisp Objects, , Modules for the Redisplay Mechanism, The Redisplay Mechanism | |
13618 @section Modules for other Display-Related Lisp Objects | |
13619 @cindex modules for other display-related Lisp objects | |
13620 @cindex display-related Lisp objects, modules for other | |
13621 @cindex Lisp objects, modules for other display-related | |
13622 | |
13623 @example | |
13624 @file{faces.c} | |
13625 @file{faces.h} | |
13626 @end example | |
13627 | |
13628 | |
13629 | |
13630 @example | |
13631 @file{bitmaps.h} | |
13632 @file{glyphs-eimage.c} | |
13633 @file{glyphs-msw.c} | |
13634 @file{glyphs-msw.h} | |
13635 @file{glyphs-widget.c} | |
13636 @file{glyphs-x.c} | |
13637 @file{glyphs-x.h} | |
13638 @file{glyphs.c} | |
13639 @file{glyphs.h} | |
13640 @end example | |
13641 | |
13642 | |
13643 | |
13644 @example | |
13645 @file{objects-msw.c} | |
13646 @file{objects-msw.h} | |
13647 @file{objects-tty.c} | |
13648 @file{objects-tty.h} | |
13649 @file{objects-x.c} | |
13650 @file{objects-x.h} | |
13651 @file{objects.c} | |
13652 @file{objects.h} | |
13653 @end example | |
13654 | |
13655 | |
13656 | |
13657 @example | |
13658 @file{menubar-msw.c} | |
13659 @file{menubar-msw.h} | |
13660 @file{menubar-x.c} | |
13661 @file{menubar.c} | |
13662 @file{menubar.h} | |
13663 @end example | |
13664 | |
13665 | |
13666 | |
13667 @example | |
13668 @file{scrollbar-msw.c} | |
13669 @file{scrollbar-msw.h} | |
13670 @file{scrollbar-x.c} | |
13671 @file{scrollbar-x.h} | |
13672 @file{scrollbar.c} | |
13673 @file{scrollbar.h} | |
13674 @end example | |
13675 | |
13676 | |
13677 | |
13678 @example | |
13679 @file{toolbar-msw.c} | |
13680 @file{toolbar-x.c} | |
13681 @file{toolbar.c} | |
13682 @file{toolbar.h} | |
13683 @end example | |
13684 | |
13685 | |
13686 | |
13687 @example | |
13688 @file{font-lock.c} | |
13689 @end example | |
13690 | |
13691 This file provides C support for syntax highlighting---i.e. | |
13692 highlighting different syntactic constructs of a source file in | |
13693 different colors, for easy reading. The C support is provided so that | |
13694 this is fast. | |
13695 | |
13696 | |
13697 | |
13698 @example | |
13699 @file{dgif_lib.c} | |
13700 @file{gif_err.c} | |
13701 @file{gif_lib.h} | |
13702 @file{gifalloc.c} | |
13703 @end example | |
13704 | |
13705 These modules decode GIF-format image files, for use with glyphs. | |
13706 These files were removed due to Unisys patent infringement concerns. | |
13707 | |
13708 | |
13709 @node Extents, Faces, The Redisplay Mechanism, Top | |
13710 @chapter Extents | |
13711 @cindex extents | |
13712 | |
13713 @menu | |
13714 * Introduction to Extents:: Extents are ranges over text, with properties. | |
13715 * Extent Ordering:: How extents are ordered internally. | |
13716 * Format of the Extent Info:: The extent information in a buffer or string. | |
13717 * Zero-Length Extents:: A weird special case. | |
13718 * Mathematics of Extent Ordering:: A rigorous foundation. | |
13719 * Extent Fragments:: Cached information useful for redisplay. | |
13720 @end menu | |
13721 | |
13722 @node Introduction to Extents, Extent Ordering, Extents, Extents | |
13723 @section Introduction to Extents | |
13724 @cindex extents, introduction to | |
13725 | |
13726 Extents are regions over a buffer, with a start and an end position | |
13727 denoting the region of the buffer included in the extent. In | |
13728 addition, either end can be closed or open, meaning that the endpoint | |
13729 is or is not logically included in the extent. Insertion of a character | |
13730 at a closed endpoint causes the character to go inside the extent; | |
13731 insertion at an open endpoint causes the character to go outside. | |
13732 | |
13733 Extent endpoints are stored using memory indices (see @file{insdel.c}), | |
13734 to minimize the amount of adjusting that needs to be done when | |
13735 characters are inserted or deleted. | |
13736 | |
13737 (Formerly, extent endpoints at the gap could be either before or | |
13738 after the gap, depending on the open/closedness of the endpoint. | |
13739 The intent of this was to make it so that insertions would | |
13740 automatically go inside or out of extents as necessary with no | |
13741 further work needing to be done. It didn't work out that way, | |
13742 however, and just ended up complexifying and buggifying all the | |
13743 rest of the code.) | |
13744 | |
13745 @node Extent Ordering, Format of the Extent Info, Introduction to Extents, Extents | |
13746 @section Extent Ordering | |
13747 @cindex extent ordering | |
13748 | |
13749 Extents are compared using memory indices. There are two orderings | |
13750 for extents and both orders are kept current at all times. The normal | |
13751 or @dfn{display} order is as follows: | |
13752 | |
13753 @example | |
13754 Extent A is ``less than'' extent B, | |
13755 that is, earlier in the display order, | |
13756 if: A-start < B-start, | |
13757 or if: A-start = B-start, and A-end > B-end | |
13758 @end example | |
13759 | |
13760 So if two extents begin at the same position, the larger of them is the | |
13761 earlier one in the display order (@code{EXTENT_LESS} is true). | |
13762 | |
13763 For the e-order, the same thing holds: | |
13764 | |
13765 @example | |
13766 Extent A is ``less than'' extent B in e-order, | |
13767 that is, later in the buffer, | |
13768 if: A-end < B-end, | |
13769 or if: A-end = B-end, and A-start > B-start | |
13770 @end example | |
13771 | |
13772 So if two extents end at the same position, the smaller of them is the | |
13773 earlier one in the e-order (@code{EXTENT_E_LESS} is true). | |
13774 | |
13775 The display order and the e-order are complementary orders: any | |
13776 theorem about the display order also applies to the e-order if you swap | |
13777 all occurrences of ``display order'' and ``e-order'', ``less than'' and | |
13778 ``greater than'', and ``extent start'' and ``extent end''. | |
13779 | |
13780 @node Format of the Extent Info, Zero-Length Extents, Extent Ordering, Extents | |
13781 @section Format of the Extent Info | |
13782 @cindex extent info, format of the | |
13783 | |
13784 An extent-info structure consists of a list of the buffer or string's | |
13785 extents and a @dfn{stack of extents} that lists all of the extents over | |
13786 a particular position. The stack-of-extents info is used for | |
13787 optimization purposes---it basically caches some info that might | |
13788 be expensive to compute. Certain otherwise hard computations are easy | |
13789 given the stack of extents over a particular position, and if the | |
13790 stack of extents over a nearby position is known (because it was | |
13791 calculated at some prior point in time), it's easy to move the stack | |
13792 of extents to the proper position. | |
13793 | |
13794 Given that the stack of extents is an optimization, and given that | |
13795 it requires memory, a string's stack of extents is wiped out each | |
13796 time a garbage collection occurs. Therefore, any time you retrieve | |
13797 the stack of extents, it might not be there. If you need it to | |
13798 be there, use the @code{_force} version. | |
13799 | |
13800 Similarly, a string may or may not have an extent_info structure. | |
13801 (Generally it won't if there haven't been any extents added to the | |
13802 string.) So use the @code{_force} version if you need the extent_info | |
13803 structure to be there. | |
13804 | |
13805 A list of extents is maintained as a double gap array. One gap array | |
13806 is ordered by start index (the @dfn{display order}) and the other is | |
13807 ordered by end index (the @dfn{e-order}). Note that positions in an | |
13808 extent list should logically be conceived of as referring @emph{to} a | |
13809 particular extent (as is the norm in programs) rather than sitting | |
13810 between two extents. Note also that callers of these functions should | |
13811 not be aware of the fact that the extent list is implemented as an | |
13812 array, except for the fact that positions are integers (this should be | |
13813 generalized to handle integers and linked list equally well). | |
13814 | |
13815 A gap array is the same structure used by buffer text: an array of | |
13816 elements with a "gap" somewhere in the middle. Insertion and deletion | |
13817 happens by moving the gap to the insertion/deletion point, and then | |
13818 expanding/contracting as necessary. Gap arrays have a number of | |
13819 useful properties: | |
13820 | |
13821 @enumerate | |
13822 @item | |
13823 They are space efficient, as there is no need for next/previous pointers. | |
13824 | |
13825 @item | |
13826 If the items in them are sorted, locating an item is fast -- @math{O(log N)}. | |
13827 | |
13828 @item | |
13829 Insertion and deletion is very fast (constant time, essentially) if the | |
13830 gap is near (which favors localized operations, as will usually be the | |
13831 case). Even if not, it requires only a block move of memory, which is | |
13832 generally a highly optimized operation on modern processors. | |
13833 | |
13834 @item | |
13835 Code to manipulate them is relatively simple to write. | |
13836 @end enumerate | |
13837 | |
13838 An alternative would be balanced binary trees, which have guaranteed | |
13839 @math{O(log N)} time for all operations (although the constant factors | |
13840 are not as good, and repeated localized operations will be slower than | |
13841 for a gap array). Such code is quite tricky to write, however. | |
13842 | |
13843 @node Zero-Length Extents, Mathematics of Extent Ordering, Format of the Extent Info, Extents | |
13844 @section Zero-Length Extents | |
13845 @cindex zero-length extents | |
13846 @cindex extents, zero-length | |
13847 | |
13848 Extents can be zero-length, and will end up that way if their endpoints | |
13849 are explicitly set that way or if their detachable property is @code{nil} | |
13850 and all the text in the extent is deleted. (The exception is open-open | |
13851 zero-length extents, which are barred from existing because there is | |
13852 no sensible way to define their properties. Deletion of the text in | |
13853 an open-open extent causes it to be converted into a closed-open | |
13854 extent.) Zero-length extents are primarily used to represent | |
13855 annotations, and behave as follows: | |
13856 | |
13857 @enumerate | |
13858 @item | |
13859 Insertion at the position of a zero-length extent expands the extent | |
13860 if both endpoints are closed; goes after the extent if it is closed-open; | |
13861 and goes before the extent if it is open-closed. | |
13862 | |
13863 @item | |
13864 Deletion of a character on a side of a zero-length extent whose | |
13865 corresponding endpoint is closed causes the extent to be detached if | |
13866 it is detachable; if the extent is not detachable or the corresponding | |
13867 endpoint is open, the extent remains in the buffer, moving as necessary. | |
13868 @end enumerate | |
13869 | |
13870 Note that closed-open, non-detachable zero-length extents behave | |
13871 exactly like markers and that open-closed, non-detachable zero-length | |
13872 extents behave like the ``point-type'' marker in Mule. | |
13873 | |
13874 @node Mathematics of Extent Ordering, Extent Fragments, Zero-Length Extents, Extents | |
13875 @section Mathematics of Extent Ordering | |
13876 @cindex mathematics of extent ordering | |
13877 @cindex extent mathematics | |
13878 @cindex extent ordering | |
13879 | |
13880 @cindex display order of extents | |
13881 @cindex extents, display order | |
13882 The extents in a buffer are ordered by ``display order'' because that | |
13883 is that order that the redisplay mechanism needs to process them in. | |
13884 The e-order is an auxiliary ordering used to facilitate operations | |
13885 over extents. The operations that can be performed on the ordered | |
13886 list of extents in a buffer are | |
13887 | |
13888 @enumerate | |
13889 @item | |
13890 Locate where an extent would go if inserted into the list. | |
13891 @item | |
13892 Insert an extent into the list. | |
13893 @item | |
13894 Remove an extent from the list. | |
13895 @item | |
13896 Map over all the extents that overlap a range. | |
13897 @end enumerate | |
13898 | |
13899 (4) requires being able to determine the first and last extents | |
13900 that overlap a range. | |
13901 | |
13902 NOTE: @dfn{overlap} is used as follows: | |
13903 | |
13904 @itemize @bullet | |
13905 @item | |
13906 two ranges overlap if they have at least one point in common. | |
13907 Whether the endpoints are open or closed makes a difference here. | |
13908 @item | |
13909 a point overlaps a range if the point is contained within the | |
13910 range; this is equivalent to treating a point @math{P} as the range | |
13911 @math{[P, P]}. | |
13912 @item | |
13913 In the case of an @emph{extent} overlapping a point or range, the extent | |
13914 is normally treated as having closed endpoints. This applies | |
13915 consistently in the discussion of stacks of extents and such below. | |
13916 Note that this definition of overlap is not necessarily consistent with | |
13917 the extents that @code{map-extents} maps over, since @code{map-extents} | |
13918 sometimes pays attention to whether the endpoints of an extents are open | |
13919 or closed. But for our purposes, it greatly simplifies things to treat | |
13920 all extents as having closed endpoints. | |
13921 @end itemize | |
13922 | |
13923 First, define @math{>}, @math{<}, @math{<=}, etc. as applied to extents | |
13924 to mean comparison according to the display order. Comparison between | |
13925 an extent @math{E} and an index @math{I} means comparison between | |
13926 @math{E} and the range @math{[I, I]}. | |
13927 | |
13928 Also define @math{e>}, @math{e<}, @math{e<=}, etc. to mean comparison | |
13929 according to the e-order. | |
13930 | |
13931 For any range @math{R}, define @math{R(0)} to be the starting index of | |
13932 the range and @math{R(1)} to be the ending index of the range. | |
13933 | |
13934 For any extent @math{E}, define @math{E(next)} to be the extent directly | |
13935 following @math{E}, and @math{E(prev)} to be the extent directly | |
13936 preceding @math{E}. Assume @math{E(next)} and @math{E(prev)} can be | |
13937 determined from @math{E} in constant time. (This is because we store | |
13938 the extent list as a doubly linked list.) | |
13939 | |
13940 Similarly, define @math{E(e-next)} and @math{E(e-prev)} to be the | |
13941 extents directly following and preceding @math{E} in the e-order. | |
13942 | |
13943 Now: | |
13944 | |
13945 Let @math{R} be a range. | |
13946 Let @math{F} be the first extent overlapping @math{R}. | |
13947 Let @math{L} be the last extent overlapping @math{R}. | |
13948 | |
13949 Theorem 1: @math{R(1)} lies between @math{L} and @math{L(next)}, | |
13950 i.e. @math{L <= R(1) < L(next)}. | |
13951 | |
13952 This follows easily from the definition of display order. The | |
13953 basic reason that this theorem applies is that the display order | |
13954 sorts by increasing starting index. | |
13955 | |
13956 Therefore, we can determine @math{L} just by looking at where we would | |
13957 insert @math{R(1)} into the list, and if we know @math{F} and are moving | |
13958 forward over extents, we can easily determine when we've hit @math{L} by | |
13959 comparing the extent we're at to @math{R(1)}. | |
13960 | |
13961 @example | |
13962 Theorem 2: @math{F(e-prev) e< [1, R(0)] e<= F}. | |
13963 @end example | |
13964 | |
13965 This is the analog of Theorem 1, and applies because the e-order | |
13966 sorts by increasing ending index. | |
13967 | |
13968 Therefore, @math{F} can be found in the same amount of time as | |
13969 operation (1), i.e. the time that it takes to locate where an extent | |
13970 would go if inserted into the e-order list. This is @math{O(log N)}, | |
13971 since we are using gap arrays to manage extents. | |
13972 | |
13973 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents | |
13974 (ordered in display order and e-order, just like for normal extent | |
13975 lists) that overlap an index @math{I}. | |
13976 | |
13977 Now: | |
13978 | |
13979 Let @math{I} be an index, let @math{S} be the stack of extents on | |
13980 @math{I} and let @math{F} be the first extent in @math{S}. | |
13981 | |
13982 Theorem 3: The first extent in @math{S} is the first extent that overlaps | |
13983 any range @math{[I, J]}. | |
13984 | |
13985 Proof: Any extent that overlaps @math{[I, J]} but does not include | |
13986 @math{I} must have a start index @math{> I}, and thus be greater than | |
13987 any extent in @math{S}. | |
13988 | |
13989 Therefore, finding the first extent that overlaps a range @math{R} is | |
13990 the same as finding the first extent that overlaps @math{R(0)}. | |
13991 | |
13992 Theorem 4: Let @math{I2} be an index such that @math{I2 > I}, and let | |
13993 @math{F2} be the first extent that overlaps @math{I2}. Then, either | |
13994 @math{F2} is in @math{S} or @math{F2} is greater than any extent in | |
13995 @math{S}. | |
13996 | |
13997 Proof: If @math{F2} does not include @math{I} then its start index is | |
13998 greater than @math{I} and thus it is greater than any extent in | |
13999 @math{S}, including @math{F}. Otherwise, @math{F2} includes @math{I} | |
14000 and thus is in @math{S}, and thus @math{F2 >= F}. | |
14001 | |
14002 @node Extent Fragments, , Mathematics of Extent Ordering, Extents | |
14003 @section Extent Fragments | |
14004 @cindex extent fragments | |
14005 @cindex fragments, extent | |
14006 | |
14007 Imagine that the buffer is divided up into contiguous, non-overlapping | |
14008 @dfn{runs} of text such that no extent starts or ends within a run | |
14009 (extents that abut the run don't count). | |
14010 | |
14011 An extent fragment is a structure that holds data about the run that | |
14012 contains a particular buffer position (if the buffer position is at the | |
14013 junction of two runs, the run after the position is used)---the | |
14014 beginning and end of the run, a list of all of the extents in that run, | |
14015 the @dfn{merged face} that results from merging all of the faces | |
14016 corresponding to those extents, the begin and end glyphs at the | |
14017 beginning of the run, etc. This is the information that redisplay needs | |
14018 in order to display this run. | |
14019 | |
14020 Extent fragments have to be very quick to update to a new buffer | |
14021 position when moving linearly through the buffer. They rely on the | |
14022 stack-of-extents code, which does the heavy-duty algorithmic work of | |
14023 determining which extents overly a particular position. | |
14024 | |
14025 @node Faces, Glyphs, Extents, Top | |
14026 @chapter Faces | |
14027 @cindex faces | |
14028 | |
14029 Not yet documented. | |
14030 | |
14031 @node Glyphs, Specifiers, Faces, Top | |
14032 @chapter Glyphs | |
14033 @cindex glyphs | |
14034 | |
14035 Glyphs are graphical elements that can be displayed in XEmacs buffers or | |
14036 gutters. We use the term graphical element here in the broadest possible | |
14037 sense since glyphs can be as mundane as text or as arcane as a native | |
14038 tab widget. | |
14039 | |
14040 In XEmacs, glyphs represent the uninstantiated state of graphical | |
14041 elements, i.e. they hold all the information necessary to produce an | |
14042 image on-screen but the image need not exist at this stage, and multiple | |
14043 screen images can be instantiated from a single glyph. | |
14044 | |
14045 @c #### find a place for this discussion | |
14046 @c The decision to make image specifiers a separate type is debatable. | |
14047 @c In fact, the design decision to create a separate image specifier | |
14048 @c type, rather than make glyphs themselves be specifiers, is | |
14049 @c debatable---the other properties of glyphs are rarely used and could | |
14050 @c conceivably have been incorporated into the glyph's instantiator. | |
14051 @c The rarely used glyph types (buffer, pointer, icon) could also have | |
14052 @c been incorporated into the instantiator. | |
14053 | |
14054 Glyphs are lazily instantiated by calling one of the glyph | |
14055 functions. This usually occurs within redisplay when | |
14056 @code{Fglyph_height} is called. Instantiation causes an image-instance | |
14057 to be created and cached. This cache is on a per-device basis for all glyphs | |
14058 except widget-glyphs, and on a per-window basis for widgets-glyphs. The | |
14059 caching is done by @code{image_instantiate} and is necessary because it | |
14060 is generally possible to display an image-instance in multiple | |
14061 domains. For instance if we create a Pixmap, we can actually display | |
14062 this on multiple windows - even though we only need a single Pixmap | |
14063 instance to do this. If caching wasn't done then it would be necessary | |
14064 to create image-instances for every displayable occurrence of a glyph - | |
14065 and every usage - and this would be extremely memory and cpu intensive. | |
14066 | |
14067 Widget-glyphs (a.k.a native widgets) are not cached in this way. This is | |
14068 because widget-glyph image-instances on screen are toolkit windows, and | |
14069 thus cannot be reused in multiple XEmacs domains. Thus widget-glyphs are | |
14070 cached on an XEmacs window basis. | |
14071 | |
14072 Any action on a glyph first consults the cache before actually | |
14073 instantiating a widget. | |
14074 | |
14075 @section Glyph Instantiation | |
14076 @cindex glyph instantiation | |
14077 @cindex instantiation, glyph | |
14078 | |
14079 Glyph instantiation is a hairy topic and requires some explanation. The | |
14080 guts of glyph instantiation is contained within | |
14081 @code{image_instantiate}. A glyph contains an image which is a | |
14082 specifier. When a glyph function - for instance @code{Fglyph_height} - | |
14083 asks for a property of the glyph that can only be determined from its | |
14084 instantiated state, then the glyph image is instantiated and an image | |
14085 instance created. The instantiation process is governed by the specifier | |
14086 code and goes through a series of steps: | |
14087 | |
14088 @itemize @bullet | |
14089 @item | |
14090 Validation. Instantiation of image instances happens dynamically - often | |
14091 within the guts of redisplay. Thus it is often not feasible to catch | |
14092 instantiator errors at instantiation time. Instead the instantiator is | |
14093 validated at the time it is added to the image specifier. This function | |
14094 is defined by @code{image_validate} and at a simple level validates | |
14095 keyword value pairs. | |
14096 @item | |
14097 Duplication. The specifier code by default takes a copy of the | |
14098 instantiator. This is reasonable for most specifiers but in the case of | |
14099 widget-glyphs can be problematic, since some of the properties in the | |
14100 instantiator - for instance callbacks - could cause infinite recursion | |
14101 in the copying process. Thus the image code defines a function - | |
14102 @code{image_copy_instantiator} - which will selectively copy values. | |
14103 This is controlled by the way that a keyword is defined either using | |
14104 @code{IIFORMAT_VALID_KEYWORD} or | |
14105 @code{IIFORMAT_VALID_NONCOPY_KEYWORD}. Note that the image caching and | |
14106 redisplay code relies on instantiator copying to ensure that current and | |
14107 new instantiators are actually different rather than referring to the | |
14108 same thing. | |
14109 @item | |
14110 Normalization. Once the instantiator has been copied it must be | |
14111 converted into a form that is viable at instantiation time. This can | |
14112 involve no changes at all, but typically involves things like converting | |
14113 file names to the actual data. This function is defined by | |
14114 @code{image_going_to_add} and @code{normalize_image_instantiator}. | |
14115 @item | |
14116 Instantiation. When an image instance is actually required for display | |
14117 it is instantiated using @code{image_instantiate}. This involves calling | |
14118 instantiate methods that are specific to the type of image being | |
14119 instantiated. | |
14120 @end itemize | |
14121 | |
14122 The final instantiation phase also involves a number of steps. In order | |
14123 to understand these we need to describe a number of concepts. | |
14124 | |
14125 An image is instantiated in a @dfn{domain}, where a domain can be any | |
14126 one of a device, frame, window or image-instance. The domain gives the | |
14127 image-instance context and identity and properties that affect the | |
14128 appearance of the image-instance may be different for the same glyph | |
14129 instantiated in different domains. An example is the face used to | |
14130 display the image-instance. | |
14131 | |
14132 Although an image is instantiated in a particular domain the | |
14133 instantiation domain is not necessarily the domain in which the | |
14134 image-instance is cached. For example a pixmap can be instantiated in a | |
14135 window be actually be cached on a per-device basis. The domain in which | |
14136 the image-instance is actually cached is called the | |
14137 @dfn{governing-domain}. A governing-domain is currently either a device | |
14138 or a window. Widget-glyphs and text-glyphs have a window as a | |
14139 governing-domain, all other image-instances have a device as the | |
14140 governing-domain. The governing domain for an image-instance is | |
14141 determined using the governing_domain image-instance method. | |
14142 | |
14143 @section Widget-Glyphs | |
14144 @cindex widget-glyphs | |
14145 | |
14146 @section Widget-Glyphs in the MS-Windows Environment | |
14147 @cindex widget-glyphs in the MS-Windows environment | |
14148 @cindex MS-Windows environment, widget-glyphs in the | |
14149 | |
14150 To Do | |
14151 | |
14152 @section Widget-Glyphs in the X Environment | |
14153 @cindex widget-glyphs in the X environment | |
14154 @cindex X environment, widget-glyphs in the | |
14155 | |
14156 Widget-glyphs under X make heavy use of lwlib (@pxref{Lucid Widget | |
14157 Library}) for manipulating the native toolkit objects. This is primarily | |
14158 so that different toolkits can be supported for widget-glyphs, just as | |
14159 they are supported for features such as menubars etc. | |
14160 | |
14161 Lwlib is extremely poorly documented and quite hairy so here is my | |
14162 understanding of what goes on. | |
14163 | |
14164 Lwlib maintains a set of widget_instances which mirror the hierarchical | |
14165 state of Xt widgets. I think this is so that widgets can be updated and | |
14166 manipulated generically by the lwlib library. For instance | |
14167 update_one_widget_instance can cope with multiple types of widget and | |
14168 multiple types of toolkit. Each element in the widget hierarchy is updated | |
14169 from its corresponding widget_instance by walking the widget_instance | |
14170 tree recursively. | |
14171 | |
14172 This has desirable properties such as lw_modify_all_widgets which is | |
14173 called from @file{glyphs-x.c} and updates all the properties of a widget | |
14174 without having to know what the widget is or what toolkit it is from. | |
14175 Unfortunately this also has hairy properties such as making the lwlib | |
14176 code quite complex. And of course lwlib has to know at some level what | |
14177 the widget is and how to set its properties. | |
14178 | |
14179 @node Specifiers, Menus, Glyphs, Top | |
14180 @chapter Specifiers | |
14181 @cindex specifiers | |
14182 | |
14183 Not yet documented. | |
14184 | |
14185 Specifiers are documented in depth in the Lisp Reference manual. | |
14186 @xref{Specifiers,,, lispref, XEmacs Lisp Reference Manual}. The code in | |
14187 @file{specifier.c} is pretty straightforward. | |
14188 | |
14189 @node Menus, Subprocesses, Specifiers, Top | |
14190 @chapter Menus | |
14191 @cindex menus | |
14192 | |
14193 A menu is set by setting the value of the variable | |
14194 @code{current-menubar} (which may be buffer-local) and then calling | |
14195 @code{set-menubar-dirty-flag} to signal a change. This will cause the | |
14196 menu to be redrawn at the next redisplay. The format of the data in | |
14197 @code{current-menubar} is described in @file{menubar.c}. | |
14198 | |
14199 Internally the data in current-menubar is parsed into a tree of | |
14200 @code{widget_value's} (defined in @file{lwlib.h}); this is accomplished | |
14201 by the recursive function @code{menu_item_descriptor_to_widget_value()}, | |
14202 called by @code{compute_menubar_data()}. Such a tree is deallocated | |
14203 using @code{free_widget_value()}. | |
14204 | |
14205 @code{update_screen_menubars()} is one of the external entry points. | |
14206 This checks to see, for each screen, if that screen's menubar needs to | |
14207 be updated. This is the case if | |
14208 | |
14209 @enumerate | |
14210 @item | |
14211 @code{set-menubar-dirty-flag} was called since the last redisplay. (This | |
14212 function sets the C variable menubar_has_changed.) | |
14213 @item | |
14214 The buffer displayed in the screen has changed. | |
14215 @item | |
14216 The screen has no menubar currently displayed. | |
14217 @end enumerate | |
14218 | |
14219 @code{set_screen_menubar()} is called for each such screen. This | |
14220 function calls @code{compute_menubar_data()} to create the tree of | |
14221 widget_value's, then calls @code{lw_create_widget()}, | |
14222 @code{lw_modify_all_widgets()}, and/or @code{lw_destroy_all_widgets()} | |
14223 to create the X-Toolkit widget associated with the menu. | |
14224 | |
14225 @code{update_psheets()}, the other external entry point, actually | |
14226 changes the menus being displayed. It uses the widgets fixed by | |
14227 @code{update_screen_menubars()} and calls various X functions to ensure | |
14228 that the menus are displayed properly. | |
14229 | |
14230 The menubar widget is set up so that @code{pre_activate_callback()} is | |
14231 called when the menu is first selected (i.e. mouse button goes down), | |
14232 and @code{menubar_selection_callback()} is called when an item is | |
14233 selected. @code{pre_activate_callback()} calls the function in | |
14234 activate-menubar-hook, which can change the menubar (this is described | |
14235 in @file{menubar.c}). If the menubar is changed, | |
14236 @code{set_screen_menubars()} is called. | |
14237 @code{menubar_selection_callback()} enqueues a menu event, putting in it | |
14238 a function to call (either @code{eval} or @code{call-interactively}) and | |
14239 its argument, which is the callback function or form given in the menu's | |
14240 description. | |
14241 | |
14242 @node Subprocesses, Interface to MS Windows, Menus, Top | |
14243 @chapter Subprocesses | 15076 @chapter Subprocesses |
14244 @cindex subprocesses | 15077 @cindex subprocesses |
14245 | 15078 |
14246 The fields of a process are: | 15079 The fields of a process are: |
14247 | 15080 |
14848 Auto-generated Unicode encapsulation functions | 15681 Auto-generated Unicode encapsulation functions |
14849 @item intl-auto-encap-win32.h | 15682 @item intl-auto-encap-win32.h |
14850 Auto-generated Unicode encapsulation headers | 15683 Auto-generated Unicode encapsulation headers |
14851 @end table | 15684 @end table |
14852 | 15685 |
14853 @node Interface to the X Window System, Future Work, Interface to MS Windows, Top | 15686 @node Interface to the X Window System, Dumping, Interface to MS Windows, Top |
14854 @chapter Interface to the X Window System | 15687 @chapter Interface to the X Window System |
14855 @cindex X Window System, interface to the | 15688 @cindex X Window System, interface to the |
14856 | 15689 |
14857 Mostly undocumented. | 15690 Mostly undocumented. |
14858 | 15691 |
15144 @file{extw-*} is common code that is used for both the client and server. | 15977 @file{extw-*} is common code that is used for both the client and server. |
15145 | 15978 |
15146 Don't touch this code; something is liable to break if you do. | 15979 Don't touch this code; something is liable to break if you do. |
15147 | 15980 |
15148 | 15981 |
15149 @node Future Work, Future Work Discussion, Interface to the X Window System, Top | 15982 @node Dumping, Future Work, Interface to the X Window System, Top |
15983 @chapter Dumping | |
15984 @cindex dumping | |
15985 | |
15986 @menu | |
15987 * Dumping Justification:: | |
15988 * Overview:: | |
15989 * Data descriptions:: | |
15990 * Dumping phase:: | |
15991 * Reloading phase:: | |
15992 * Remaining issues:: | |
15993 @end menu | |
15994 | |
15995 @node Dumping Justification, Overview, Dumping, Dumping | |
15996 @section Dumping Justification | |
15997 @cindex dumping, justification | |
15998 | |
15999 The C code of XEmacs is just a Lisp engine with a lot of built-in | |
16000 primitives useful for writing an editor. The editor itself is written | |
16001 mostly in Lisp, and represents around 100K lines of code. Loading and | |
16002 executing the initialization of all this code takes a bit a time (five | |
16003 to ten times the usual startup time of current xemacs) and requires | |
16004 having all the lisp source files around. Having to reload them each | |
16005 time the editor is started would not be acceptable. | |
16006 | |
16007 The traditional solution to this problem is called dumping: the build | |
16008 process first creates the lisp engine under the name @file{temacs}, then | |
16009 runs it until it has finished loading and initializing all the lisp | |
16010 code, and eventually creates a new executable called @file{xemacs} | |
16011 including both the object code in @file{temacs} and all the contents of | |
16012 the memory after the initialization. | |
16013 | |
16014 This solution, while working, has a huge problem: the creation of the | |
16015 new executable from the actual contents of memory is an extremely | |
16016 system-specific process, quite error-prone, and which interferes with a | |
16017 lot of system libraries (like malloc). It is even getting worse | |
16018 nowadays with libraries using constructors which are automatically | |
16019 called when the program is started (even before @code{main()}) which tend to | |
16020 crash when they are called multiple times, once before dumping and once | |
16021 after (IRIX 6.x @file{libz.so} pulls in some C++ image libraries thru | |
16022 dependencies which have this problem). Writing the dumper is also one | |
16023 of the most difficult parts of porting XEmacs to a new operating system. | |
16024 Basically, `dumping' is an operation that is just not officially | |
16025 supported on many operating systems. | |
16026 | |
16027 The aim of the portable dumper is to solve the same problem as the | |
16028 system-specific dumper, that is to be able to reload quickly, using only | |
16029 a small number of files, the fully initialized lisp part of the editor, | |
16030 without any system-specific hacks. | |
16031 | |
16032 @node Overview, Data descriptions, Dumping Justification, Dumping | |
16033 @section Overview | |
16034 @cindex dumping overview | |
16035 | |
16036 The portable dumping system has to: | |
16037 | |
16038 @enumerate | |
16039 @item | |
16040 At dump time, write all initialized, non-quickly-rebuildable data to a | |
16041 file [Note: currently named @file{xemacs.dmp}, but the name will | |
16042 change], along with all information needed for the reloading. | |
16043 | |
16044 @item | |
16045 When starting xemacs, reload the dump file, relocate it to its new | |
16046 starting address if needed, and reinitialize all pointers to this | |
16047 data. Also, rebuild all the quickly rebuildable data. | |
16048 @end enumerate | |
16049 | |
16050 Note: As of 21.5.18, the dump file has been moved inside of the | |
16051 executable, although there are still problems with this on some systems. | |
16052 | |
16053 @node Data descriptions, Dumping phase, Overview, Dumping | |
16054 @section Data descriptions | |
16055 @cindex dumping data descriptions | |
16056 | |
16057 The more complex task of the dumper is to be able to write memory blocks | |
16058 on the heap (lisp objects, i.e. lrecords, and C-allocated memory, such | |
16059 as structs and arrays) to disk and reload them at a different address, | |
16060 updating all the pointers they include in the process. This is done by | |
16061 using external data descriptions that give information about the layout | |
16062 of the blocks in memory. | |
16063 | |
16064 The specification of these descriptions is in lrecord.h. A description | |
16065 of an lrecord is an array of struct memory_description. Each of these | |
16066 structs include a type, an offset in the block and some optional | |
16067 parameters depending on the type. For instance, here is the string | |
16068 description: | |
16069 | |
16070 @example | |
16071 static const struct memory_description string_description[] = @{ | |
16072 @{ XD_BYTECOUNT, offsetof (Lisp_String, size) @}, | |
16073 @{ XD_OPAQUE_DATA_PTR, offsetof (Lisp_String, data), XD_INDIRECT(0, 1) @}, | |
16074 @{ XD_LISP_OBJECT, offsetof (Lisp_String, plist) @}, | |
16075 @{ XD_END @} | |
16076 @}; | |
16077 @end example | |
16078 | |
16079 The first line indicates a member of type Bytecount, which is used by | |
16080 the next, indirect directive. The second means "there is a pointer to | |
16081 some opaque data in the field @code{data}". The length of said data is | |
16082 given by the expression @code{XD_INDIRECT(0, 1)}, which means "the value | |
16083 in the 0th line of the description (welcome to C) plus one". The third | |
16084 line means "there is a Lisp_Object member @code{plist} in the Lisp_String | |
16085 structure". @code{XD_END} then ends the description. | |
16086 | |
16087 This gives us all the information we need to move around what is pointed | |
16088 to by a memory block (C or lrecord) and, by transitivity, everything | |
16089 that it points to. The only missing information for dumping is the size | |
16090 of the block. For lrecords, this is part of the | |
16091 lrecord_implementation, so we don't need to duplicate it. For C blocks | |
16092 we use a struct sized_memory_description, which includes a size field | |
16093 and a pointer to an associated array of memory_description. | |
16094 | |
16095 @node Dumping phase, Reloading phase, Data descriptions, Dumping | |
16096 @section Dumping phase | |
16097 @cindex dumping phase | |
16098 | |
16099 Dumping is done by calling the function @code{pdump()} (in @file{dumper.c}) which is | |
16100 invoked from Fdump_emacs (in @file{emacs.c}). This function performs a number | |
16101 of tasks. | |
16102 | |
16103 @menu | |
16104 * Object inventory:: | |
16105 * Address allocation:: | |
16106 * The header:: | |
16107 * Data dumping:: | |
16108 * Pointers dumping:: | |
16109 @end menu | |
16110 | |
16111 @node Object inventory, Address allocation, Dumping phase, Dumping phase | |
16112 @subsection Object inventory | |
16113 @cindex dumping object inventory | |
16114 @cindex memory blocks | |
16115 | |
16116 The first task is to build the list of the objects to dump. This | |
16117 includes: | |
16118 | |
16119 @itemize @bullet | |
16120 @item lisp objects | |
16121 @item other memory blocks (C structures, arrays. etc) | |
16122 @end itemize | |
16123 | |
16124 We end up with one @code{pdump_block_list_elt} per object group (arrays | |
16125 of C structs are kept together) which includes a pointer to the first | |
16126 object of the group, the per-object size and the count of objects in the | |
16127 group, along with some other information which is initialized later. | |
16128 | |
16129 These entries are linked together in @code{pdump_block_list} structures | |
16130 and can be enumerated thru either: | |
16131 | |
16132 @enumerate | |
16133 @item | |
16134 the @code{pdump_object_table}, an array of @code{pdump_block_list}, one | |
16135 per lrecord type, indexed by type number. | |
16136 | |
16137 @item | |
16138 the @code{pdump_opaque_data_list}, used for the opaque data which does | |
16139 not include pointers, and hence does not need descriptions. | |
16140 | |
16141 @item | |
16142 the @code{pdump_desc_table}, which is a vector of | |
16143 @code{memory_description}/@code{pdump_block_list} pairs, used for | |
16144 non-opaque C memory blocks. | |
16145 @end enumerate | |
16146 | |
16147 This uses a marking strategy similar to the garbage collector. Some | |
16148 differences though: | |
16149 | |
16150 @enumerate | |
16151 @item | |
16152 We do not use the mark bit (which does not exist for generic memory blocks | |
16153 anyway); we use a big hash table instead. | |
16154 | |
16155 @item | |
16156 We do not use the mark function of lrecords but instead rely on the | |
16157 external descriptions. This happens essentially because we need to | |
16158 follow pointers to generic memory blocks and opaque data in addition to | |
16159 Lisp_Object members. | |
16160 @end enumerate | |
16161 | |
16162 This is done by @code{pdump_register_object()}, which handles | |
16163 Lisp_Object variables, and @code{pdump_register_block()} which handles | |
16164 generic memory blocks (C structures, arrays, etc.), which both delegate | |
16165 the description management to @code{pdump_register_sub()}. | |
16166 | |
16167 The hash table doubles as a map object to pdump_block_list_elmt (i.e. | |
16168 allows us to look up a pdump_block_list_elmt with the object it points | |
16169 to). Entries are added with @code{pdump_add_block()} and looked up with | |
16170 @code{pdump_get_block()}. There is no need for entry removal. The hash | |
16171 value is computed quite simply from the object pointer by | |
16172 @code{pdump_make_hash()}. | |
16173 | |
16174 The roots for the marking are: | |
16175 | |
16176 @enumerate | |
16177 @item | |
16178 the @code{staticpro}'ed variables (there is a special | |
16179 @code{staticpro_nodump()} call for protected variables we do not want to | |
16180 dump). | |
16181 | |
16182 @item | |
16183 the Lisp_Object variables registered via @code{dump_add_root_lisp_object} | |
16184 (@code{staticpro()} is equivalent to @code{staticpro_nodump()} + | |
16185 @code{dump_add_root_lisp_object()}). | |
16186 | |
16187 @item | |
16188 the data-segment memory blocks registered via @code{dump_add_root_block} | |
16189 (for blocks with relocatable pointers), or @code{dump_add_opaque} (for | |
16190 "opaque" blocks with no relocatable pointers; this is just a shortcut | |
16191 for calling @code{dump_add_root_block} with a NULL description). | |
16192 | |
16193 @item | |
16194 the pointer variables registered via @code{dump_add_root_block_ptr}, | |
16195 each of which points to a block of heap memory (generally a C structure | |
16196 or array). Note that @code{dump_add_root_block_ptr} is not technically | |
16197 necessary, as a pointer variable can be seen as a special case of a | |
16198 data-segment memory block and registered using | |
16199 @code{dump_add_root_block}. Doing it this way, however, would require | |
16200 another level of static structures declared. Since pointer variables | |
16201 are quite common, @code{dump_add_root_block_ptr} is provided for | |
16202 convenience. Note also that internally we have to treat it separately | |
16203 from @code{dump_add_root_block} rather than writing the former as a call | |
16204 to the latter, since we don't have support for creating and using memory | |
16205 descriptions on the fly -- they must all be statically declared in the | |
16206 data-segment. | |
16207 @end enumerate | |
16208 | |
16209 This does not include the GCPRO'ed variables, the specbinds, the | |
16210 catchtags, the backlist, the redisplay or the profiling info, since we | |
16211 do not want to rebuild the actual chain of lisp calls which end up to | |
16212 the dump-emacs call, only the global variables. | |
16213 | |
16214 Weak lists and weak hash tables are dumped as if they were their | |
16215 non-weak equivalent (without changing their type, of course). This has | |
16216 not yet been a problem. | |
16217 | |
16218 @node Address allocation, The header, Object inventory, Dumping phase | |
16219 @subsection Address allocation | |
16220 @cindex dumping address allocation | |
16221 | |
16222 | |
16223 The next step is to allocate the offsets of each of the objects in the | |
16224 final dump file. This is done by @code{pdump_allocate_offset()} which | |
16225 is called indirectly by @code{pdump_scan_by_alignment()}. | |
16226 | |
16227 The strategy to deal with alignment problems uses these facts: | |
16228 | |
16229 @enumerate | |
16230 @item | |
16231 real world alignment requirements are powers of two. | |
16232 | |
16233 @item | |
16234 the C compiler is required to adjust the size of a struct so that you | |
16235 can have an array of them next to each other. This means you can have an | |
16236 upper bound of the alignment requirements of a given structure by | |
16237 looking at which power of two its size is a multiple. | |
16238 | |
16239 @item | |
16240 the non-variant part of variable size lrecords has an alignment | |
16241 requirement of 4. | |
16242 @end enumerate | |
16243 | |
16244 Hence, for each lrecord type, C struct type or opaque data block the | |
16245 alignment requirement is computed as a power of two, with a minimum of | |
16246 2^2 for lrecords. @code{pdump_scan_by_alignment()} then scans all the | |
16247 @code{pdump_block_list_elmt}'s, the ones with the highest requirements | |
16248 first. This ensures the best packing. | |
16249 | |
16250 The maximum alignment requirement we take into account is 2^8. | |
16251 | |
16252 @code{pdump_allocate_offset()} only has to do a linear allocation, | |
16253 starting at offset 256 (this leaves room for the header and keeps the | |
16254 alignments happy). | |
16255 | |
16256 @node The header, Data dumping, Address allocation, Dumping phase | |
16257 @subsection The header | |
16258 @cindex dumping, the header | |
16259 | |
16260 The next step creates the file and writes a header with a signature and | |
16261 some random information in it. The @code{reloc_address} field, which | |
16262 indicates at which address the file should be loaded if we want to avoid | |
16263 post-reload relocation, is set to 0. It then seeks to offset 256 (base | |
16264 offset for the objects). | |
16265 | |
16266 @node Data dumping, Pointers dumping, The header, Dumping phase | |
16267 @subsection Data dumping | |
16268 @cindex data dumping | |
16269 @cindex dumping, data | |
16270 | |
16271 The data is dumped in the same order as the addresses were allocated by | |
16272 @code{pdump_dump_data()}, called from @code{pdump_scan_by_alignment()}. | |
16273 This function copies the data to a temporary buffer, relocates all | |
16274 pointers in the object to the addresses allocated in step Address | |
16275 Allocation, and writes it to the file. Using the same order means that, | |
16276 if we are careful with lrecords whose size is not a multiple of 4, we | |
16277 are ensured that the object is always written at the offset in the file | |
16278 allocated in step Address Allocation. | |
16279 | |
16280 @node Pointers dumping, , Data dumping, Dumping phase | |
16281 @subsection Pointers dumping | |
16282 @cindex pointers dumping | |
16283 @cindex dumping, pointers | |
16284 | |
16285 A bunch of tables needed to reassign properly the global pointers are | |
16286 then written. They are: | |
16287 | |
16288 @enumerate | |
16289 @item | |
16290 the pdump_root_block_ptrs dynarr | |
16291 @item | |
16292 the pdump_opaques dynarr | |
16293 @item | |
16294 a vector of all the offsets to the objects in the file that include a | |
16295 description (for faster relocation at reload time) | |
16296 @item | |
16297 the pdump_root_objects and pdump_weak_object_chains dynarrs. | |
16298 @end enumerate | |
16299 | |
16300 For each of the dynarrs we write both the pointer to the variables and | |
16301 the relocated offset of the object they point to. Since these variables | |
16302 are global, the pointers are still valid when restarting the program and | |
16303 are used to regenerate the global pointers. | |
16304 | |
16305 The @code{pdump_weak_object_chains} dynarr is a special case. The | |
16306 variables it points to are the head of weak linked lists of lisp objects | |
16307 of the same type. Not all objects of this list are dumped so the | |
16308 relocated pointer we associate with them points to the first dumped | |
16309 object of the list, or Qnil if none is available. This is also the | |
16310 reason why they are not used as roots for the purpose of object | |
16311 enumeration. | |
16312 | |
16313 Some very important information like the @code{staticpros} and | |
16314 @code{lrecord_implementations_table} are handled indirectly using | |
16315 @code{dump_add_opaque} or @code{dump_add_root_block_ptr}. | |
16316 | |
16317 This is the end of the dumping part. | |
16318 | |
16319 @node Reloading phase, Remaining issues, Dumping phase, Dumping | |
16320 @section Reloading phase | |
16321 @cindex reloading phase | |
16322 @cindex dumping, reloading phase | |
16323 | |
16324 @subsection File loading | |
16325 @cindex dumping, file loading | |
16326 | |
16327 The file is mmap'ed in memory (which ensures a PAGESIZE alignment, at | |
16328 least 4096), or if mmap is unavailable or fails, a 256-bytes aligned | |
16329 malloc is done and the file is loaded. | |
16330 | |
16331 Some variables are reinitialized from the values found in the header. | |
16332 | |
16333 The difference between the actual loading address and the reloc_address | |
16334 is computed and will be used for all the relocations. | |
16335 | |
16336 | |
16337 @subsection Putting back the pdump_opaques | |
16338 @cindex dumping, putting back the pdump_opaques | |
16339 | |
16340 The memory contents are restored in the obvious and trivial way. | |
16341 | |
16342 | |
16343 @subsection Putting back the pdump_root_block_ptrs | |
16344 @cindex dumping, putting back the pdump_root_block_ptrs | |
16345 | |
16346 The variables pointed to by pdump_root_block_ptrs in the dump phase are | |
16347 reset to the right relocated object addresses. | |
16348 | |
16349 | |
16350 @subsection Object relocation | |
16351 @cindex dumping, object relocation | |
16352 | |
16353 All the objects are relocated using their description and their offset | |
16354 by @code{pdump_reloc_one}. This step is unnecessary if the | |
16355 reloc_address is equal to the file loading address. | |
16356 | |
16357 | |
16358 @subsection Putting back the pdump_root_objects and pdump_weak_object_chains | |
16359 @cindex dumping, putting back the pdump_root_objects and pdump_weak_object_chains | |
16360 | |
16361 Same as Putting back the pdump_root_block_ptrs. | |
16362 | |
16363 | |
16364 @subsection Reorganize the hash tables | |
16365 @cindex dumping, reorganize the hash tables | |
16366 | |
16367 Since some of the hash values in the lisp hash tables are | |
16368 address-dependent, their layout is now wrong. So we go through each of | |
16369 them and have them resorted by calling @code{pdump_reorganize_hash_table}. | |
16370 | |
16371 @node Remaining issues, , Reloading phase, Dumping | |
16372 @section Remaining issues | |
16373 @cindex dumping, remaining issues | |
16374 | |
16375 The build process will have to start a post-dump xemacs, ask it the | |
16376 loading address (which will, hopefully, be always the same between | |
16377 different xemacs invocations) [[unfortunately, not true on Linux with | |
16378 the ExecShield feature]] and relocate the file to the new address. | |
16379 This way the object relocation phase will not have to be done, which | |
16380 means no writes in the objects and that, because of the use of mmap, the | |
16381 dumped data will be shared between all the xemacs running on the | |
16382 computer. | |
16383 | |
16384 Some executable signature will be necessary to ensure that a given dump | |
16385 file is really associated with a given executable, or random crashes | |
16386 will occur. Maybe a random number set at compile or configure time thru | |
16387 a define. This will also allow for having differently-compiled xemacsen | |
16388 on the same system (mule and no-mule comes to mind). | |
16389 | |
16390 The DOC file contents should probably end up in the dump file. | |
16391 | |
16392 | |
16393 @node Future Work, Future Work Discussion, Dumping, Top | |
15150 @chapter Future Work | 16394 @chapter Future Work |
15151 @cindex future work | 16395 @cindex future work |
15152 | 16396 |
15153 @menu | 16397 @menu |
16398 * Future Work -- General Suggestions:: | |
15154 * Future Work -- Elisp Compatibility Package:: | 16399 * Future Work -- Elisp Compatibility Package:: |
15155 * Future Work -- Drag-n-Drop:: | 16400 * Future Work -- Drag-n-Drop:: |
15156 * Future Work -- Standard Interface for Enabling Extensions:: | 16401 * Future Work -- Standard Interface for Enabling Extensions:: |
15157 * Future Work -- Better Initialization File Scheme:: | 16402 * Future Work -- Better Initialization File Scheme:: |
15158 * Future Work -- Keyword Parameters:: | 16403 * Future Work -- Keyword Parameters:: |
15173 * Future Work -- Display Tables:: | 16418 * Future Work -- Display Tables:: |
15174 * Future Work -- Making Elisp Function Calls Faster:: | 16419 * Future Work -- Making Elisp Function Calls Faster:: |
15175 * Future Work -- Lisp Engine Replacement:: | 16420 * Future Work -- Lisp Engine Replacement:: |
15176 @end menu | 16421 @end menu |
15177 | 16422 |
15178 @ignore | 16423 @node Future Work -- General Suggestions, Future Work -- Elisp Compatibility Package, Future Work, Future Work |
15179 Macro to convert a single line containing a heading into the format of | 16424 @section Future Work -- General Suggestions |
15180 all headings in the Future Work section. | 16425 @cindex future work, general suggestions |
15181 | 16426 @cindex general suggestions, future work |
15182 (setq last-kbd-macro (read-kbd-macro | 16427 |
15183 "<S-end> <f3> <home> @node SPC <end> RET @section SPC <f4> <home> <up> <C-right> <right> Future SPC Work SPC - - SPC <home> <down> <C-right> <right> Future SPC Work SPC - - SPC <end> RET @cindex SPC future SPC work, SPC <f4> C-r , RET C-x C-x M-l RET @cindex SPC <f4> <home> <C-right> <S-end> M-l , SPC future SPC work RET")) | 16428 @subheading Jamie Zawinski's XEmacs Wishlist |
15184 @end ignore | 16429 |
15185 | 16430 This document is based on Jamie Zawinski's |
15186 @node Future Work -- Elisp Compatibility Package, Future Work -- Drag-n-Drop, Future Work, Future Work | 16431 @uref{http://www.jwz.org/doc/xemacs-wishlist.html,xemacs wishlist}. |
16432 Throughout this page, ``I'' refers to Jamie. | |
16433 | |
16434 The list has been substantially reformatted and edited to fit the needs | |
16435 of this site. If you have any soul at all, you'll go check out the | |
16436 original. OK? You should also check out some other | |
16437 @uref{http://www.xemacs.org/Releases/Public-21.2/execution.html#wishlists,wishlists}. | |
16438 | |
16439 | |
16440 @subsubheading About the List | |
16441 | |
16442 I've ranked these (roughly) from easiest to hardest; though of all of | |
16443 them, I think the debugger improvements would be the most useful. I think | |
16444 the combination of emacs+gdb is the best Unix development environment | |
16445 currently available, but it's still lamentably primitive and extremely | |
16446 frustrating (much like Unix itself), especially if you know what kinds of | |
16447 features more modern integrated debuggers have. | |
16448 | |
16449 @subsubheading XEmacs Wishlist | |
16450 | |
16451 @table @strong | |
16452 @item Improve the keyboard macro system. | |
16453 | |
16454 Keyboard macros are one of the most useful concepts that emacs has to | |
16455 offer, but there's room for improvement. | |
16456 | |
16457 @table @strong | |
16458 @item Make it possible to embed one macro inside of another. | |
16459 | |
16460 Often, I'll define a keyboard macro, and then realize that I've | |
16461 left something out, or that there's more that I need to do; for | |
16462 example, I may define a macro that does something to the current line, | |
16463 and then realize that I want to apply it to a lot of lines. So, I'd | |
16464 like this to work: | |
16465 | |
16466 @example | |
16467 @kbd{C-x ( } | |
16468 ; start macro #1 | |
16469 @kbd{... } | |
16470 ; (do stuff) | |
16471 @kbd{C-x ) } | |
16472 ; done with macro #1 | |
16473 @kbd{... } | |
16474 ; (do stuff) | |
16475 @kbd{C-x ( } | |
16476 ; start macro #2 | |
16477 @kbd{C-x e } | |
16478 ; execute macro #1 (splice it into macro #2) | |
16479 @kbd{C-s foo } | |
16480 ; move forward to the next spot | |
16481 @kbd{C-x ) } | |
16482 ; done with macro #2 | |
16483 @kbd{C-u 1000 C-x e } | |
16484 ; apply the new macro | |
16485 @end example | |
16486 | |
16487 That is, simply, one should be able to wrap new text around an | |
16488 existing macro. I can't tell you how many times I've defined a complex | |
16489 macro but left out the ``@kbd{C-n C-a}'' at the end... | |
16490 | |
16491 Yes, you can accomplish this with M-x name-last-kbd-macro, but | |
16492 that's a pain. And it's also more permanent than I'd often like. | |
16493 @item Make it possible to correct errors when defining a macro. | |
16494 | |
16495 Right now, the act of defining a macro stops if you get an error | |
16496 while defining it, and all of the characters you've already typed into | |
16497 the macro are gone. It needn't be that way. I think that, when that | |
16498 first error occurs, the user should be given the option of taking the | |
16499 last command off of the macro and trying again. | |
16500 | |
16501 The macro-reader knows where the bounds of multi-character command | |
16502 sequences are, and it could even keep track of the corresponding undo | |
16503 records; rubbing out the previous entry on the macro could also undo | |
16504 any changes that command had made. (This should also work if the macro | |
16505 spans multiple buffers, and should restore window configurations as | |
16506 well.) | |
16507 | |
16508 You'd want multi-level undo for this as well, so maybe the way to | |
16509 go would be to add some new key sequence which was used only as the | |
16510 back-up-inside-a-keyboard-macro-definition command. | |
16511 | |
16512 I'm not totally sure that this would end up being very usable; | |
16513 maybe it would be too hard to deal with. Which brings us to: | |
16514 @item Make it possible to edit a keyboard macro after it has been defined. | |
16515 | |
16516 I only just discovered @code{edit-kbd-macro} (@kbd{C-x C-k}). | |
16517 It is very, very cool. | |
16518 | |
16519 The trick it does of showing the command which will be executed is | |
16520 somewhat error-prone, as it can only look up things in the current map | |
16521 or the global map; if the macro changed buffers, it wouldn't be | |
16522 displaying the right commands. (One of the things I often use macros | |
16523 for is operating on many files at once, by bringing up a dired buffer | |
16524 of those files, editing them, and then moving on to the next.) | |
16525 | |
16526 However, if the act of recording a macro also kept track of the | |
16527 actual commands that had gotten executed, it could make use of that | |
16528 info as well. | |
16529 | |
16530 Another way of editing a macro, other than as text in a buffer, | |
16531 would be to have a command which single-steps a macro: you would lean | |
16532 on the space bar to watch the macro execute one character (command?) | |
16533 at a time, and then when you reached the point you wanted to change, | |
16534 you could do some gesture to either: insert some keystrokes into the | |
16535 middle of the macro and then continue; or to replace the rest of the | |
16536 macro from here to the end; or something. | |
16537 | |
16538 Another similar hack might be to convert a macro to the equivalent | |
16539 lisp code, so that one could tweak it later in ways that would be too | |
16540 hard to do from the keyboard (wrapping parts of it in @code{while} loops or | |
16541 something.) (@kbd{M-x insert-kbd-macro} isn't really what I'm | |
16542 talking about here: I mean insert the list of commands, not the list | |
16543 of keystrokes.) | |
16544 @end table | |
16545 | |
16546 @item Save my wrists! | |
16547 | |
16548 In the spirit of the `@code{teach-extended-commands-p}' variable, | |
16549 it would be interesting if emacs would keep track of what are the | |
16550 commands I use most often, perhaps grouped by proximity or mode -- it | |
16551 would then be more obvious which commands were most likely candidates | |
16552 for placement on a toolbar, or popup menu, or just a more convenient key | |
16553 binding. | |
16554 | |
16555 Bonus points if it figures out that I type ``@kbd{bt\n}'' and | |
16556 ``@kbd{ret\ny\n}'' into my @samp{*gdb*} buffer about a hundred | |
16557 thousand times a day. | |
16558 @item XmCreateFileSelectionBox | |
16559 | |
16560 The thing that ``File/Open...'' pops up has excellent @emph{hack} | |
16561 value, but as a user interface, it's an abomination. Isn't it time | |
16562 someone added a real file selection dialog already? (For the | |
16563 Motifly-challenged, the Athena-based file selector that GhostView uses | |
16564 seems adequate.) | |
16565 @item Improve the toolbar system. | |
16566 | |
16567 It's great that XEmacs has a toolbar, but it's damn near impossible | |
16568 to customize it. | |
16569 | |
16570 @table @strong | |
16571 @item Make it easy to define new toolbar buttons. | |
16572 | |
16573 Currently, to define a toolbar button that has a text equivalent, | |
16574 one must edit a pixmap, and put the text there! That's prohibitive. | |
16575 One should be able to add some kind of generic toolbar button, with a | |
16576 plain icon or none at all, but which has a text label, without having | |
16577 to use a paint program. | |
16578 @item Make it easy to have customized, mode-local toolbars. | |
16579 | |
16580 In my @code{c-mode-hook}, for example, I can add a couple of new | |
16581 keybindings, and delete a few others, and to do that, I don't have to | |
16582 duplicate the entire definition of the @code{c-mode-map}. Making | |
16583 mode-local additions and subtractions to the toolbars should be as | |
16584 easy. | |
16585 @item Make it easy to have customized, mode-local popup menus. | |
16586 | |
16587 The same situation holds for the right-mouse-button popup menu; one | |
16588 should be able to add new commands to those menus without difficulty. | |
16589 One problem is that each mode which does have a popup menu implements | |
16590 it in a different way... | |
16591 @end table | |
16592 | |
16593 @item Make the External Widget work. | |
16594 | |
16595 About half of the work is done to make a replacement for the | |
16596 @code{XmText} widget which offloads editing responsibility to an | |
16597 external Emacs process. Someone should finish that. The benefit here | |
16598 would be that then, any Motif program could be linked such that all | |
16599 editing happened with a real Emacs behind it. (If you're Athena-minded, | |
16600 flavor with @code{Text} instead of @code{XmText} -- it's probably | |
16601 easy to make it work with both.) | |
16602 | |
16603 The part of this that is done already is the ability to run an Emacs | |
16604 screen on a Window object that has been created by another process (this | |
16605 is what the @file{ExternalClient.c} and @file{ExternalShell.c} stuff | |
16606 is.) What is left to be done is, adding the text-widget-editor aspects | |
16607 of this. | |
16608 | |
16609 First, the emacs screen being displayed on that window would have to | |
16610 be one without a modeline, and one which behaved sensibly in the context | |
16611 of ``I am a small multi-line text area embedded in a dialog box'' as | |
16612 opposed to ``I am a full-on text editor and lord of all that I survey.'' | |
16613 | |
16614 Second, the API that the (non-emacs-aware) user of the | |
16615 @code{XmText} widget expects would need to be implemented: give the | |
16616 caller the ability to pull the edited text string back out, and so on. | |
16617 The idea here being, hooking up emacs as the widget editor should be as | |
16618 transparent as possible. | |
16619 @item Bring the debugger interface into the eighties. | |
16620 | |
16621 Some of you may have seen my @file{gdb-highlight.el} | |
16622 package, that I posted to gnu.emacs.sources last month. I think | |
16623 it's really cool, but there should be a lot more work in that direction. | |
16624 For those of you who haven't seen it, what it does is watch text that | |
16625 gets inserted into the @samp{*gdb*} buffer and make very nearly | |
16626 everything be clickable and have a context-sensitive menu. Generally, | |
16627 the types that are noticed are: | |
16628 | |
16629 @itemize | |
16630 @item function names; | |
16631 @item variable and parameter names; | |
16632 @item structure slots; | |
16633 @item source file names; | |
16634 @item type names; | |
16635 @item breakpoint numbers; | |
16636 @item stack frame numbers. | |
16637 @end itemize | |
16638 | |
16639 Any time one of those objects is presented in the @samp{*gdb*} | |
16640 buffer, it is mousable. Clicking middle button on it takes some default | |
16641 action (edits the function, selects the stack frame, disables the | |
16642 breakpoint, ...) Clicking the right button pops up a menu of commands, | |
16643 including commands specific to the object under the mouse, and/or other | |
16644 objects on the same line. | |
16645 | |
16646 So that's all well and good, and I get far more joy out of what this | |
16647 code does for me than I expected, but there are still a bunch of | |
16648 limitations. The debugger interface needs to do much, much more. | |
16649 | |
16650 @table @strong | |
16651 @item Make gdbsrc-mode not suck. | |
16652 | |
16653 The idea behind @code{gdbsrc-mode} is on the side of the angels: | |
16654 one should be able to focus on the source code and not on the debugger | |
16655 buffer, absolutely. But the implementation is just awful. | |
16656 | |
16657 First and foremost, it should not change ``modes'' (in the more | |
16658 general sense). Any commands that it defines should be on keys which | |
16659 are exclusively used for that purpose, not keys which are normally | |
16660 self-inserting. I can't be the only person who usually has occasion to | |
16661 actually @emph{edit} the sources which the debugger has chosen to | |
16662 display! Switching into and out of @code{gdbsrc-mode} is | |
16663 prohibitive. | |
16664 | |
16665 I want to be looking at my sources at all times, yet I don't want | |
16666 to have to give up my source-editing gestures. I think the right way | |
16667 to accomplish this is to put the gdbsrc commands on the toolbar and on | |
16668 popup menus; or to let the user define their own keys (I could see | |
16669 devoting my @key{kp_enter} key to ``step'', or something common | |
16670 like that.) | |
16671 | |
16672 Also it's extremely frustrating that one can't turn off gdbsrc mode | |
16673 once it has been loaded, without exiting and restarting emacs; that | |
16674 alone means that I'd probably never take the time to learn how to use | |
16675 it, without first having taken the time to repair it... | |
16676 @item Make it easier access to variable values. | |
16677 | |
16678 I want to be able to double-click on a variable name to highlight | |
16679 it, and then drag it to the debugger window to have its value printed. | |
16680 | |
16681 I want gestures that let me write as well as read: for example, to | |
16682 store value A into slot B. | |
16683 @item Make all breakpoints visible. | |
16684 | |
16685 Any time there is a running gdb which has breakpoints, the buffers | |
16686 holding the lines on which those breakpoints are set should have icons | |
16687 in them. These icons should be context-sensitive: I should be able to | |
16688 pop up a menu to enable or disable them, to delete them, to change | |
16689 their commands or conditions. | |
16690 | |
16691 I should also be able to @emph{move} them. It's | |
16692 annoying when you have a breakpoint with a complex condition or | |
16693 command on it, and then you realize that you really want it to be at a | |
16694 different location. I want to be able to drag-and-drop the icon to its | |
16695 new home. | |
16696 @item Make a debugger status display window. | |
16697 | |
16698 @itemize | |
16699 @item | |
16700 | |
16701 I want a window off to the side that shows persistent information | |
16702 -- it should have a pane which is a drag-editable, drag-reorderable | |
16703 representation of the elements on gdb's ``display'' list; they | |
16704 should be displayed here instead of being just dumped in with the | |
16705 rest of the output in the @samp{*gdb*} buffer. | |
16706 @item | |
16707 | |
16708 I want a pane that displays the current call-stack and nothing | |
16709 else. I want a pane that displays the arguments and locals of the | |
16710 currently-selected frame and nothing else. I want these both to | |
16711 update as I move around on the stack. | |
16712 @item | |
16713 | |
16714 Since the unfortunate reality is that excavating this information | |
16715 from gdb can be slow, it would be a good idea for these panes to | |
16716 have a toggle button on them which meant ``stop updating'', so that | |
16717 when I want to move fast, I can, but I can easily get the display | |
16718 back when I need it again. | |
16719 @end itemize | |
16720 | |
16721 The reason for all of this is that I spend entirely too much time | |
16722 scrolling around in the @samp{*gdb*} buffer; with gdb-highlight, I | |
16723 can just click on a line in the backtrace output to go to that frame, | |
16724 but I find that I spend a lot of time @emph{looking} for that | |
16725 backtrace: since it's mixed in with all the other random output, I | |
16726 waste time looking around for things (and usually just give up and | |
16727 type ``@kbd{bt}'' again, then thrash around as the buffer scrolls, | |
16728 and I try to find the lower frames that I'm interested in, as they | |
16729 have invariably scrolled off the window already... | |
16730 @item Save and restore breakpoints across emacs/debugger sessions. | |
16731 | |
16732 This would be especially handy given that gdb leaks like a sieve, | |
16733 and with a big program, I only get a few dozen relink-and-rerun | |
16734 attempts before gdb has blown my swap space. | |
16735 @item Keep breakpoints in sync with source lines. | |
16736 | |
16737 When a program is recompiled and then reloaded into gdb, the | |
16738 breakpoints often end up in less-than-useful places. For example, when | |
16739 I edit text which occurs in a file anywhere before a breakpoint, emacs | |
16740 is aware that the line of the bp hasn't changed, but just that it is | |
16741 in a different place relative to the top of the file. Gdb doesn't know | |
16742 this, so your breakpoints end up getting set in the wrong places | |
16743 (usually the maximally inconvenient places, like @emph{after} a loop | |
16744 instead of @emph{inside} it). But emacs knows, so emacs should | |
16745 inform the debugger, and move the breakpoints back to the places they | |
16746 were intended to be. | |
16747 @end table | |
16748 | |
16749 (Possibly the OOBR stuff does some of this, but can't tell, because | |
16750 I've never been able to get it to do anything but beep at me and mumble | |
16751 about environments. I find it pretty funny that the manual keeps | |
16752 explaining to me how intuitive it is, without actually giving me a clue | |
16753 how to launch it...) | |
16754 @item Add better dialog box features. | |
16755 | |
16756 It'd be nice to be able to create more complex dialog boxes from | |
16757 emacs-lisp: ones with checkboxes, radio button groups, text fields, and | |
16758 popup menus. | |
16759 @item Add embeddable dialog boxes. | |
16760 | |
16761 One of the things that the now-defunct Energize code (the C side of | |
16762 it, that is) could do was embed a dialog box between the toolbar and the | |
16763 main text area -- buffers could have control panels associated with | |
16764 them, that had all kinds of complex behavior. | |
16765 @item Make the mark-stack be visible. | |
16766 | |
16767 You know, I've encountered people who have been using emacs for | |
16768 years, and never use the mark stack for navigation. I can't live without | |
16769 it; ``@kbd{C-u C-SPC}'' is among my most common gestures. | |
16770 | |
16771 @enumerate | |
16772 @item | |
16773 | |
16774 It would be a lot easier to realize what's going to happen if the | |
16775 marks on the mark stack were visible. They could be displayed as small | |
16776 ``caret'' glyphs, for example; something large enough to be visible, | |
16777 but not easily mistaken for a character or for the cursor. | |
16778 @item | |
16779 | |
16780 The marks and the selected region should be visible in the | |
16781 scrollbar as well -- I don't remember where I first saw this idea, but | |
16782 it's very cool: there's a second, less-strongly-rendered ``thumb'' in | |
16783 the scrollbar which indicates the position and size of the selection; | |
16784 and there are tiny tick-marks which indicate the positions of the | |
16785 saved points. | |
16786 @item | |
16787 | |
16788 Markers which are in registers (@code{point-to-register}, @kbd{C-x | |
16789 /}) should be displayed differently (more prominent.) | |
16790 @item | |
16791 | |
16792 It'd be cool if you could pick up markers and move them around, to | |
16793 adjust the points you'll be coming back to later. | |
16794 @end enumerate | |
16795 | |
16796 @item Write a new garbage collector. | |
16797 | |
16798 The emacs GC is very primitive; it is also, fortunately, a | |
16799 rather well isolated module, and it would not be a very big task to swap | |
16800 it with a new one (once that new one was written, that is.) Someone | |
16801 should go bone up on modern GC techniques, and then just dive right | |
16802 in... | |
16803 @item Add support for lexical scope to the emacs-lisp runtime. | |
16804 | |
16805 Yadda yadda, this list goes to eleven. | |
16806 @end table | |
16807 | |
16808 @* | |
16809 Subject: | |
16810 @strong{Re: XEmacs wishlist} | |
16811 Date: Wed, 14 May 1997 16:18:23 -0700 | |
16812 From: Jamie Zawinski <jwz@@netscape.com> | |
16813 Newsgroups: comp.emacs.xemacs, comp.emacs | |
16814 | |
16815 Andreas Schwab wrote: | |
16816 | |
16817 @quotation | |
16818 @emph{Use `C-u C-x (': } | |
16819 | |
16820 @emph{start-kbd-macro:@*Non-nil arg (prefix arg) means append to last | |
16821 macro defined; This begins by re-executing that macro as if you typed it | |
16822 again. } | |
16823 @end quotation | |
16824 | |
16825 Cool, I didn't know it did that... | |
16826 | |
16827 But it only lets you append. I often want to prepend, or embed the | |
16828 macro multiple times (motion 1, C-x e, motion 2, C-x e, motion 3.) | |
16829 | |
16830 @subheading 21.2 Showstoppers | |
16831 | |
16832 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16833 | |
16834 DISTRIBUTION ISSUES | |
16835 | |
16836 A. Unified Source Tarball. | |
16837 | |
16838 Packages go under root/lib/xemacs/xemacs-packages and no one ever has | |
16839 to mess with --package-path and the result can be moved from one | |
16840 directory to another pre- or post-install. | |
16841 | |
16842 | |
16843 Unified Binary Tarballs with Packages. | |
16844 | |
16845 Same principles as above. | |
16846 | |
16847 If people complain, we can also provide split binary tarballs | |
16848 (architecture dependent and independent) and place these files in a | |
16849 subdirectory so as not to confuse the majority just looking for one | |
16850 tarball. | |
16851 | |
16852 Under Windows, we need to provide a WISE-style GUI setup program. It's | |
16853 already there but needs some work so you can select "all" packages | |
16854 easily (should be the default). | |
16855 | |
16856 Parallel Root and Package Trees. | |
16857 | |
16858 If the user downloads separately, the main source and the packages, he | |
16859 will naturally untar them into the same directory. This results in the | |
16860 parallel root and package structure. We should support this as a "last | |
16861 resort," i.e., if we find no packages anywhere and are about to resign | |
16862 ourselves to not having packages, then look for a parallel package | |
16863 tree. The user who sets things up like this should be able to either | |
16864 run in place or "make install" and get a proper installed | |
16865 XEmacs. Never should the user have to touch --package-path. | |
16866 | |
16867 II. WINDOWS PRINTING | |
16868 | |
16869 Looks like the internals are done but not the GUI. This must be | |
16870 working in 21.2. | |
16871 | |
16872 III. WINDOWS MULE | |
16873 | |
16874 Basic support should be there. There's already a patch to get things | |
16875 started and I'll be doing more work to make this real. | |
16876 | |
16877 IV. GUTTER ETC. | |
16878 | |
16879 This stuff needs to be "stable" and generally free from bugs. Any | |
16880 API's we create need to be well-reviewed or marked clearly as | |
16881 experimental. | |
16882 | |
16883 V. PORTABLE DUMPER | |
16884 | |
16885 Last bits need to be cleaned up. This should be made the "default" for | |
16886 a while to flush-out problems. Under Microsoft Windows, Portable | |
16887 Dumper must be the default in 21.2 because of the problems with the | |
16888 existing dump process. | |
16889 | |
16890 COMMENT: I'd like to feature freeze this pretty soon and create a 21.3 | |
16891 tree where all of my major overhauls of Mule-related stuff will go | |
16892 in. At the same time or around, we need to do the move-around in the | |
16893 repository (or create a new one) and "upgrade" to the latest CVS | |
16894 server. | |
16895 | |
16896 @node Future Work -- Elisp Compatibility Package, Future Work -- Drag-n-Drop, Future Work -- General Suggestions, Future Work | |
15187 @section Future Work -- Elisp Compatibility Package | 16897 @section Future Work -- Elisp Compatibility Package |
15188 @cindex future work, elisp compatibility package | 16898 @cindex future work, elisp compatibility package |
15189 @cindex elisp compatibility package, future work | 16899 @cindex elisp compatibility package, future work |
16900 | |
16901 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
15190 | 16902 |
15191 A while ago I created a package called Sysdep, which aimed to be a | 16903 A while ago I created a package called Sysdep, which aimed to be a |
15192 forward compatibility package for Elisp. The idea was that instead of | 16904 forward compatibility package for Elisp. The idea was that instead of |
15193 having to write your package using the oldest version of Emacs that you | 16905 having to write your package using the oldest version of Emacs that you |
15194 wanted to support, you could use the newest XEmacs API, and then simply | 16906 wanted to support, you could use the newest XEmacs API, and then simply |
15320 where a function is called using @code{funcall} or @code{apply}. | 17032 where a function is called using @code{funcall} or @code{apply}. |
15321 However, such uses of functions would not be affected by the surrounding | 17033 However, such uses of functions would not be affected by the surrounding |
15322 macrolet call, and so there doesn't appear to be any point in extracting | 17034 macrolet call, and so there doesn't appear to be any point in extracting |
15323 them). | 17035 them). |
15324 | 17036 |
15325 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
15326 | |
15327 @node Future Work -- Drag-n-Drop, Future Work -- Standard Interface for Enabling Extensions, Future Work -- Elisp Compatibility Package, Future Work | 17037 @node Future Work -- Drag-n-Drop, Future Work -- Standard Interface for Enabling Extensions, Future Work -- Elisp Compatibility Package, Future Work |
15328 @section Future Work -- Drag-n-Drop | 17038 @section Future Work -- Drag-n-Drop |
15329 @cindex future work, drag-n-drop | 17039 @cindex future work, drag-n-drop |
15330 @cindex drag-n-drop, future work | 17040 @cindex drag-n-drop, future work |
17041 | |
17042 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
15331 | 17043 |
15332 @strong{Abstract:} I propose completely redoing the drag-n-drop | 17044 @strong{Abstract:} I propose completely redoing the drag-n-drop |
15333 interface to make it powerful and extensible enough to support such | 17045 interface to make it powerful and extensible enough to support such |
15334 concepts as drag over and drag under visuals and context menus invoked | 17046 concepts as drag over and drag under visuals and context menus invoked |
15335 when a drag is done with the right mouse button, to allow drop handlers | 17047 when a drag is done with the right mouse button, to allow drop handlers |
15437 drop, etc. This event is always passed to any function that is invoked | 17149 drop, etc. This event is always passed to any function that is invoked |
15438 as a result of the drag or drop. There should never be any need to | 17150 as a result of the drag or drop. There should never be any need to |
15439 refer to the @code{current-mouse-event} variable, and in fact, this | 17151 refer to the @code{current-mouse-event} variable, and in fact, this |
15440 variable should not be changed at all during a drag or a drop. | 17152 variable should not be changed at all during a drag or a drop. |
15441 | 17153 |
15442 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
15443 | |
15444 @node Future Work -- Standard Interface for Enabling Extensions, Future Work -- Better Initialization File Scheme, Future Work -- Drag-n-Drop, Future Work | 17154 @node Future Work -- Standard Interface for Enabling Extensions, Future Work -- Better Initialization File Scheme, Future Work -- Drag-n-Drop, Future Work |
15445 @section Future Work -- Standard Interface for Enabling Extensions | 17155 @section Future Work -- Standard Interface for Enabling Extensions |
15446 @cindex future work, standard interface for enabling extensions | 17156 @cindex future work, standard interface for enabling extensions |
15447 @cindex standard interface for enabling extensions, future work | 17157 @cindex standard interface for enabling extensions, future work |
17158 | |
17159 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
15448 | 17160 |
15449 @strong{Abstract:} Apparently, if you know the name of a package (for | 17161 @strong{Abstract:} Apparently, if you know the name of a package (for |
15450 example, @code{fusion}), you can load it using the @code{require} | 17162 example, @code{fusion}), you can load it using the @code{require} |
15451 function, but there's no standard way to turn it on or turn it off. The | 17163 function, but there's no standard way to turn it on or turn it off. The |
15452 only way to figure out how to do that is to go read the source file, | 17164 only way to figure out how to do that is to go read the source file, |
15518 extensions and a judgment on first of all, how commonly a user might | 17230 extensions and a judgment on first of all, how commonly a user might |
15519 want this extension, and second of all, how well written and bug-free | 17231 want this extension, and second of all, how well written and bug-free |
15520 the package is. Both of these sorts of judgments could be obtained by | 17232 the package is. Both of these sorts of judgments could be obtained by |
15521 doing user surveys if need be. | 17233 doing user surveys if need be. |
15522 | 17234 |
15523 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
15524 | |
15525 @node Future Work -- Better Initialization File Scheme, Future Work -- Keyword Parameters, Future Work -- Standard Interface for Enabling Extensions, Future Work | 17235 @node Future Work -- Better Initialization File Scheme, Future Work -- Keyword Parameters, Future Work -- Standard Interface for Enabling Extensions, Future Work |
15526 @section Future Work -- Better Initialization File Scheme | 17236 @section Future Work -- Better Initialization File Scheme |
15527 @cindex future work, better initialization file scheme | 17237 @cindex future work, better initialization file scheme |
15528 @cindex better initialization file scheme, future work | 17238 @cindex better initialization file scheme, future work |
17239 | |
17240 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
15529 | 17241 |
15530 @strong{Abstract:} A proposal is outlined for converting XEmacs to use | 17242 @strong{Abstract:} A proposal is outlined for converting XEmacs to use |
15531 the @code{.xemacs} subdirectory for its initialization files instead of | 17243 the @code{.xemacs} subdirectory for its initialization files instead of |
15532 putting them in the user's home directory. In the process, a general | 17244 putting them in the user's home directory. In the process, a general |
15533 pre-initialization scheme is created whereby all of the initialization | 17245 pre-initialization scheme is created whereby all of the initialization |
15625 @code{init.el} or @code{pre-init.el}, or if neither of those files is | 17337 @code{init.el} or @code{pre-init.el}, or if neither of those files is |
15626 present, then it doesn't contain any sub-directories or files that look | 17338 present, then it doesn't contain any sub-directories or files that look |
15627 like what would be in a package root), then it becomes the value of the | 17339 like what would be in a package root), then it becomes the value of the |
15628 init file directory. Otherwise the user's home directory is used. | 17340 init file directory. Otherwise the user's home directory is used. |
15629 @item | 17341 @item |
15630 | |
15631 | 17342 |
15632 If the init file directory is the user's home directory, then the init | 17343 If the init file directory is the user's home directory, then the init |
15633 file is called @code{.emacs}. Otherwise, it's called @code{init.el}. | 17344 file is called @code{.emacs}. Otherwise, it's called @code{init.el}. |
15634 @item | 17345 @item |
15635 | |
15636 | 17346 |
15637 If the init file directory is the user's home directory, then the | 17347 If the init file directory is the user's home directory, then the |
15638 pre-init file is called @code{.xemacs-pre-init.el}. Otherwise it's | 17348 pre-init file is called @code{.xemacs-pre-init.el}. Otherwise it's |
15639 called @code{pre-init.el}. (One of the reasons for this rule has to do | 17349 called @code{pre-init.el}. (One of the reasons for this rule has to do |
15640 with the dialog box that might be displayed at startup. This will be | 17350 with the dialog box that might be displayed at startup. This will be |
15641 described below.) | 17351 described below.) |
15642 @item | 17352 @item |
15643 | |
15644 | 17353 |
15645 If the init file directory is the user's home directory, then the custom | 17354 If the init file directory is the user's home directory, then the custom |
15646 init file is called @code{.xemacs-custom-init.el}. Otherwise, it's | 17355 init file is called @code{.xemacs-custom-init.el}. Otherwise, it's |
15647 called @code{custom-init.el}. | 17356 called @code{custom-init.el}. |
15648 | 17357 |
15712 | 17421 |
15713 If an error occurs in the init file, then the initial frame should | 17422 If an error occurs in the init file, then the initial frame should |
15714 always be created and mapped at that time so that the error is displayed | 17423 always be created and mapped at that time so that the error is displayed |
15715 and the debugger has a place to be invoked. | 17424 and the debugger has a place to be invoked. |
15716 | 17425 |
15717 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
15718 | |
15719 @node Future Work -- Keyword Parameters, Future Work -- Property Interface Changes, Future Work -- Better Initialization File Scheme, Future Work | 17426 @node Future Work -- Keyword Parameters, Future Work -- Property Interface Changes, Future Work -- Better Initialization File Scheme, Future Work |
15720 @section Future Work -- Keyword Parameters | 17427 @section Future Work -- Keyword Parameters |
15721 @cindex future work, keyword parameters | 17428 @cindex future work, keyword parameters |
15722 @cindex keyword parameters, future work | 17429 @cindex keyword parameters, future work |
17430 | |
17431 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
15723 | 17432 |
15724 NOTE: These changes are partly motivated by the various user-interface | 17433 NOTE: These changes are partly motivated by the various user-interface |
15725 changes elsewhere in this document, and partly for Mule support. In | 17434 changes elsewhere in this document, and partly for Mule support. In |
15726 general the various API's in this document would benefit greatly from | 17435 general the various API's in this document would benefit greatly from |
15727 built-in keywords. | 17436 built-in keywords. |
15770 @item | 17479 @item |
15771 | 17480 |
15772 The subr object type needs to be modified to contain additional slots | 17481 The subr object type needs to be modified to contain additional slots |
15773 for the number and names of any keyword parameters. | 17482 for the number and names of any keyword parameters. |
15774 @item | 17483 @item |
15775 | |
15776 | 17484 |
15777 The implementation of the @code{funcall} function needs to be modified | 17485 The implementation of the @code{funcall} function needs to be modified |
15778 so that it knows how to process keyword parameters. This is the only | 17486 so that it knows how to process keyword parameters. This is the only |
15779 place that will require very much intricate coding, and much of the | 17487 place that will require very much intricate coding, and much of the |
15780 logic that would need to be added can be lifted directly from the | 17488 logic that would need to be added can be lifted directly from the |
15781 @code{cl} code. | 17489 @code{cl} code. |
15782 @item | 17490 @item |
15783 | |
15784 | 17491 |
15785 A new macro, similar to the @code{DEFUN} macro, and probably called | 17492 A new macro, similar to the @code{DEFUN} macro, and probably called |
15786 @code{DEFUN_WITH_KEYWORDS}, needs to be defined so that built-in Lisp | 17493 @code{DEFUN_WITH_KEYWORDS}, needs to be defined so that built-in Lisp |
15787 primitives containing keywords can be created. Now, the | 17494 primitives containing keywords can be created. Now, the |
15788 @code{DEFUN_WITH_KEYWORDS} macro should take an additional parameter | 17495 @code{DEFUN_WITH_KEYWORDS} macro should take an additional parameter |
15802 that specifies the number of keyword parameters. However, this would | 17509 that specifies the number of keyword parameters. However, this would |
15803 require some additional complexity in the preprocessor definition of the | 17510 require some additional complexity in the preprocessor definition of the |
15804 @code{DEFUN_WITH_KEYWORDS} macro, and probably isn't worth | 17511 @code{DEFUN_WITH_KEYWORDS} macro, and probably isn't worth |
15805 implementing). | 17512 implementing). |
15806 @item | 17513 @item |
15807 | |
15808 | 17514 |
15809 The byte compiler would have to be modified slightly so that it knows | 17515 The byte compiler would have to be modified slightly so that it knows |
15810 about keyword parameters when it parses the parameter declaration of a | 17516 about keyword parameters when it parses the parameter declaration of a |
15811 function. For example, so that it issues the correct warnings | 17517 function. For example, so that it issues the correct warnings |
15812 concerning calls to that function with incorrect arguments. | 17518 concerning calls to that function with incorrect arguments. |
15813 @item | 17519 @item |
15814 | |
15815 | 17520 |
15816 The @code{make-docfile} program would have to be modified so that it | 17521 The @code{make-docfile} program would have to be modified so that it |
15817 generates the correct parameter lists for primitives defined using the | 17522 generates the correct parameter lists for primitives defined using the |
15818 @code{DEFUN_WITH_KEYWORDS} macro. | 17523 @code{DEFUN_WITH_KEYWORDS} macro. |
15819 @item | 17524 @item |
15820 | |
15821 | 17525 |
15822 Possibly other aspects of the help system that deal with function | 17526 Possibly other aspects of the help system that deal with function |
15823 descriptions might have to be modified. | 17527 descriptions might have to be modified. |
15824 @item | 17528 @item |
15825 | |
15826 | 17529 |
15827 A helper function might need to be defined to make it easier for | 17530 A helper function might need to be defined to make it easier for |
15828 primitives that use both the @code{&rest} and @code{&key} | 17531 primitives that use both the @code{&rest} and @code{&key} |
15829 specifiers to parse their argument lists. | 17532 specifiers to parse their argument lists. |
15830 | 17533 |
15890 @node Future Work -- Property Interface Changes, Future Work -- Toolbars, Future Work -- Keyword Parameters, Future Work | 17593 @node Future Work -- Property Interface Changes, Future Work -- Toolbars, Future Work -- Keyword Parameters, Future Work |
15891 @section Future Work -- Property Interface Changes | 17594 @section Future Work -- Property Interface Changes |
15892 @cindex future work, property interface changes | 17595 @cindex future work, property interface changes |
15893 @cindex property interface changes, future work | 17596 @cindex property interface changes, future work |
15894 | 17597 |
17598 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17599 | |
15895 In my past work on XEmacs, I already expanded the standard property | 17600 In my past work on XEmacs, I already expanded the standard property |
15896 functions of @code{get}, @code{put}, and @code{remprop} to work on | 17601 functions of @code{get}, @code{put}, and @code{remprop} to work on |
15897 objects other than symbols and defined an additional function | 17602 objects other than symbols and defined an additional function |
15898 @code{object-plist} for this interface. I'd like to expand this | 17603 @code{object-plist} for this interface. I'd like to expand this |
15899 interface further and advertise it as the standard way to make property | 17604 interface further and advertise it as the standard way to make property |
15911 @dfn{unbound}, which is to say that its value has not been explicitly | 17616 @dfn{unbound}, which is to say that its value has not been explicitly |
15912 specified. Note: the way to make a property unbound is to call | 17617 specified. Note: the way to make a property unbound is to call |
15913 @code{remprop}. Note also that for some built-in properties, setting | 17618 @code{remprop}. Note also that for some built-in properties, setting |
15914 the property to its default value is equivalent to making it unbound. | 17619 the property to its default value is equivalent to making it unbound. |
15915 @item | 17620 @item |
15916 | |
15917 | 17621 |
15918 The behavior of the @code{get} function is modified. If the @code{get} | 17622 The behavior of the @code{get} function is modified. If the @code{get} |
15919 function is called on a property that is unbound and the third, optional | 17623 function is called on a property that is unbound and the third, optional |
15920 @var{default} argument is @code{nil}, then the default value of the | 17624 @var{default} argument is @code{nil}, then the default value of the |
15921 property is returned. If the @var{default} argument is not @code{nil}, | 17625 property is returned. If the @var{default} argument is not @code{nil}, |
15925 initial default value of @code{nil}. Code that calls the @code{get} | 17629 initial default value of @code{nil}. Code that calls the @code{get} |
15926 function and specifies @code{nil} for the @var{default} argument, and | 17630 function and specifies @code{nil} for the @var{default} argument, and |
15927 expects to get @code{nil} returned if the property is unbound, is almost | 17631 expects to get @code{nil} returned if the property is unbound, is almost |
15928 certainly wrong anyway. | 17632 certainly wrong anyway. |
15929 @item | 17633 @item |
15930 | |
15931 | 17634 |
15932 A new function, @code{get1} is defined. This function does not take a | 17635 A new function, @code{get1} is defined. This function does not take a |
15933 default argument like the @code{get} function. Instead, if the property | 17636 default argument like the @code{get} function. Instead, if the property |
15934 is unbound, an error is signaled. Note: @code{get} can be implemented | 17637 is unbound, an error is signaled. Note: @code{get} can be implemented |
15935 in terms of @code{get1}. | 17638 in terms of @code{get1}. |
15936 @item | 17639 @item |
15937 | |
15938 | 17640 |
15939 New functions @code{property-default-value} and @code{property-bound-p} | 17641 New functions @code{property-default-value} and @code{property-bound-p} |
15940 are defined with the obvious semantics. | 17642 are defined with the obvious semantics. |
15941 @item | 17643 @item |
15942 | |
15943 | 17644 |
15944 An additional function @code{property-built-in-p} is defined which takes | 17645 An additional function @code{property-built-in-p} is defined which takes |
15945 two arguments, the first one being a symbol naming an object type, and | 17646 two arguments, the first one being a symbol naming an object type, and |
15946 the second one specifying a property, and indicates whether the property | 17647 the second one specifying a property, and indicates whether the property |
15947 name has a built-in meaning for objects of that type. | 17648 name has a built-in meaning for objects of that type. |
15948 @item | 17649 @item |
15949 | |
15950 | 17650 |
15951 It is not necessary, or even desirable, for all object types to allow | 17651 It is not necessary, or even desirable, for all object types to allow |
15952 user-defined properties. It is always possible to simulate user-defined | 17652 user-defined properties. It is always possible to simulate user-defined |
15953 properties for an object by using a weak hash table. Therefore, whether | 17653 properties for an object by using a weak hash table. Therefore, whether |
15954 an object allows a user to define properties or not should depend on the | 17654 an object allows a user to define properties or not should depend on the |
15955 meaning of the object. If an object does not allow user-defined | 17655 meaning of the object. If an object does not allow user-defined |
15956 properties, the @code{put} function should signal an error, such as | 17656 properties, the @code{put} function should signal an error, such as |
15957 @code{undefined-property}, when given any property other than those that | 17657 @code{undefined-property}, when given any property other than those that |
15958 are predefined. | 17658 are predefined. |
15959 @item | 17659 @item |
15960 | |
15961 | 17660 |
15962 A function called @code{user-defined-properties-allowed-p} should be | 17661 A function called @code{user-defined-properties-allowed-p} should be |
15963 defined with the obvious semantics. (See the previous item.) | 17662 defined with the obvious semantics. (See the previous item.) |
15964 @item | 17663 @item |
15965 | |
15966 | 17664 |
15967 Three more functions should be defined, called | 17665 Three more functions should be defined, called |
15968 @code{built-in-property-name-list}, @code{property-name-list}, and | 17666 @code{built-in-property-name-list}, @code{property-name-list}, and |
15969 @code{user-defined-property-name-list}. | 17667 @code{user-defined-property-name-list}. |
15970 | 17668 |
15986 | 17684 |
15987 e.g. (define-property-method 'hash-table | 17685 e.g. (define-property-method 'hash-table |
15988 :put #'(lambda (obj key value) (puthash key obj value))) | 17686 :put #'(lambda (obj key value) (puthash key obj value))) |
15989 @end example | 17687 @end example |
15990 | 17688 |
15991 | |
15992 @node Future Work -- Toolbars, Future Work -- Menu API Changes, Future Work -- Property Interface Changes, Future Work | 17689 @node Future Work -- Toolbars, Future Work -- Menu API Changes, Future Work -- Property Interface Changes, Future Work |
15993 @section Future Work -- Toolbars | 17690 @section Future Work -- Toolbars |
15994 @cindex future work, toolbars | 17691 @cindex future work, toolbars |
15995 @cindex toolbars | 17692 @cindex toolbars |
15996 | 17693 |
16001 | 17698 |
16002 @node Future Work -- Easier Toolbar Customization, Future Work -- Toolbar Interface Changes, Future Work -- Toolbars, Future Work -- Toolbars | 17699 @node Future Work -- Easier Toolbar Customization, Future Work -- Toolbar Interface Changes, Future Work -- Toolbars, Future Work -- Toolbars |
16003 @subsection Future Work -- Easier Toolbar Customization | 17700 @subsection Future Work -- Easier Toolbar Customization |
16004 @cindex future work, easier toolbar customization | 17701 @cindex future work, easier toolbar customization |
16005 @cindex easier toolbar customization, future work | 17702 @cindex easier toolbar customization, future work |
17703 | |
17704 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16006 | 17705 |
16007 @strong{Abstract:} One of XEmacs' greatest strengths is its ability to | 17706 @strong{Abstract:} One of XEmacs' greatest strengths is its ability to |
16008 be customized endlessly. Unfortunately, it is often too difficult to | 17707 be customized endlessly. Unfortunately, it is often too difficult to |
16009 figure out how to do this. There has been some recent work like the | 17708 figure out how to do this. There has been some recent work like the |
16010 Custom package, which helps in this regard, but I think there's a lot | 17709 Custom package, which helps in this regard, but I think there's a lot |
16053 ones, would be the ability to change the font size of the captions. I'm | 17752 ones, would be the ability to change the font size of the captions. I'm |
16054 sure that Kyle, for one, would appreciate this. | 17753 sure that Kyle, for one, would appreciate this. |
16055 | 17754 |
16056 (This is incomplete.....) | 17755 (This is incomplete.....) |
16057 | 17756 |
16058 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16059 | |
16060 @node Future Work -- Toolbar Interface Changes, , Future Work -- Easier Toolbar Customization, Future Work -- Toolbars | 17757 @node Future Work -- Toolbar Interface Changes, , Future Work -- Easier Toolbar Customization, Future Work -- Toolbars |
16061 @subsection Future Work -- Toolbar Interface Changes | 17758 @subsection Future Work -- Toolbar Interface Changes |
16062 @cindex future work, toolbar interface changes | 17759 @cindex future work, toolbar interface changes |
16063 @cindex toolbar interface changes, future work | 17760 @cindex toolbar interface changes, future work |
17761 | |
17762 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16064 | 17763 |
16065 I propose changing the way that toolbars are specified to make them more | 17764 I propose changing the way that toolbars are specified to make them more |
16066 flexible. | 17765 flexible. |
16067 | 17766 |
16068 @enumerate | 17767 @enumerate |
16205 @node Future Work -- Menu API Changes, Future Work -- Removal of Misc-User Event Type, Future Work -- Toolbars, Future Work | 17904 @node Future Work -- Menu API Changes, Future Work -- Removal of Misc-User Event Type, Future Work -- Toolbars, Future Work |
16206 @section Future Work -- Menu API Changes | 17905 @section Future Work -- Menu API Changes |
16207 @cindex future work, menu API changes | 17906 @cindex future work, menu API changes |
16208 @cindex menu API changes, future work | 17907 @cindex menu API changes, future work |
16209 | 17908 |
17909 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16210 | 17910 |
16211 @enumerate | 17911 @enumerate |
16212 @item | 17912 @item |
16213 | 17913 |
16214 I propose making a specifier for the menubar associated with the frame. | 17914 I propose making a specifier for the menubar associated with the frame. |
16258 properties may not actually be implemented at first, but at least the | 17958 properties may not actually be implemented at first, but at least the |
16259 keywords for them should be defined. | 17959 keywords for them should be defined. |
16260 | 17960 |
16261 @end enumerate | 17961 @end enumerate |
16262 | 17962 |
16263 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16264 | |
16265 @node Future Work -- Removal of Misc-User Event Type, Future Work -- Mouse Pointer, Future Work -- Menu API Changes, Future Work | 17963 @node Future Work -- Removal of Misc-User Event Type, Future Work -- Mouse Pointer, Future Work -- Menu API Changes, Future Work |
16266 @section Future Work -- Removal of Misc-User Event Type | 17964 @section Future Work -- Removal of Misc-User Event Type |
16267 @cindex future work, removal of misc-user event type | 17965 @cindex future work, removal of misc-user event type |
16268 @cindex removal of misc-user event type, future work | 17966 @cindex removal of misc-user event type, future work |
17967 | |
17968 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16269 | 17969 |
16270 @strong{Abstract:} This page describes why the misc-user event type | 17970 @strong{Abstract:} This page describes why the misc-user event type |
16271 should be split up into a number of different event types, and how to do | 17971 should be split up into a number of different event types, and how to do |
16272 this. | 17972 this. |
16273 | 17973 |
16312 @node Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Busy Pointer, Future Work -- Mouse Pointer, Future Work -- Mouse Pointer | 18012 @node Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Busy Pointer, Future Work -- Mouse Pointer, Future Work -- Mouse Pointer |
16313 @subsection Future Work -- Abstracted Mouse Pointer Interface | 18013 @subsection Future Work -- Abstracted Mouse Pointer Interface |
16314 @cindex future work, abstracted mouse pointer interface | 18014 @cindex future work, abstracted mouse pointer interface |
16315 @cindex abstracted mouse pointer interface, future work | 18015 @cindex abstracted mouse pointer interface, future work |
16316 | 18016 |
18017 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18018 | |
16317 @strong{Abstract:} We need to create a new image format that allows | 18019 @strong{Abstract:} We need to create a new image format that allows |
16318 standard pointer shapes to be specified in a way that works on all | 18020 standard pointer shapes to be specified in a way that works on all |
16319 Windows systems. I suggest that this be called @code{pointer}, which | 18021 Windows systems. I suggest that this be called @code{pointer}, which |
16320 has one tag associated with it, named @code{:data}, and whose value is a | 18022 has one tag associated with it, named @code{:data}, and whose value is a |
16321 string. The possible strings that can be specified here are predefined | 18023 string. The possible strings that can be specified here are predefined |
16336 be @code{mswindows-resource}. At least in the case of | 18038 be @code{mswindows-resource}. At least in the case of |
16337 @code{cursor-font}, the old value should be maintained for compatibility | 18039 @code{cursor-font}, the old value should be maintained for compatibility |
16338 as an obsolete alias. The @code{resource} format was added so recently | 18040 as an obsolete alias. The @code{resource} format was added so recently |
16339 that it's possible that we can just change it. | 18041 that it's possible that we can just change it. |
16340 | 18042 |
16341 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16342 | |
16343 @node Future Work -- Busy Pointer, , Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Mouse Pointer | 18043 @node Future Work -- Busy Pointer, , Future Work -- Abstracted Mouse Pointer Interface, Future Work -- Mouse Pointer |
16344 @subsection Future Work -- Busy Pointer | 18044 @subsection Future Work -- Busy Pointer |
16345 @cindex future work, busy pointer | 18045 @cindex future work, busy pointer |
16346 @cindex busy pointer, future work | 18046 @cindex busy pointer, future work |
18047 | |
18048 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16347 | 18049 |
16348 Automatically make the mouse pointer switch to a busy shape (watch | 18050 Automatically make the mouse pointer switch to a busy shape (watch |
16349 signal) when XEmacs has been "busy" for more than, e.g. 2 seconds. | 18051 signal) when XEmacs has been "busy" for more than, e.g. 2 seconds. |
16350 Define the @dfn{busy time} as the time since the last time that XEmacs was | 18052 Define the @dfn{busy time} as the time since the last time that XEmacs was |
16351 ready to receive input from the user. An implementation might be: | 18053 ready to receive input from the user. An implementation might be: |
16393 @node Future Work -- Everything should obey duplicable extents, , Future Work -- Extents, Future Work -- Extents | 18095 @node Future Work -- Everything should obey duplicable extents, , Future Work -- Extents, Future Work -- Extents |
16394 @subsection Future Work -- Everything should obey duplicable extents | 18096 @subsection Future Work -- Everything should obey duplicable extents |
16395 @cindex future work, everything should obey duplicable extents | 18097 @cindex future work, everything should obey duplicable extents |
16396 @cindex everything should obey duplicable extents, future work | 18098 @cindex everything should obey duplicable extents, future work |
16397 | 18099 |
18100 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18101 | |
16398 A lot of functions don't properly track duplicable extents. For | 18102 A lot of functions don't properly track duplicable extents. For |
16399 example, the @code{concat} function does, but the @code{format} function | 18103 example, the @code{concat} function does, but the @code{format} function |
16400 does not, and extents in keymap prompts are not displayed either. All | 18104 does not, and extents in keymap prompts are not displayed either. All |
16401 of the functions that generate strings or string-like entities should | 18105 of the functions that generate strings or string-like entities should |
16402 track the extents that are associated with the strings. Currently this | 18106 track the extents that are associated with the strings. Currently this |
16423 a Lisp string into a @code{lisp_string_struct}. However, there is | 18127 a Lisp string into a @code{lisp_string_struct}. However, there is |
16424 already a function @code{copy_string_extents()} that does basically this | 18128 already a function @code{copy_string_extents()} that does basically this |
16425 exact thing, and it should be easy to create a modified version of this | 18129 exact thing, and it should be easy to create a modified version of this |
16426 function. | 18130 function. |
16427 | 18131 |
16428 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16429 | |
16430 @node Future Work -- Version Number and Development Tree Organization, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Extents, Future Work | 18132 @node Future Work -- Version Number and Development Tree Organization, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Extents, Future Work |
16431 @section Future Work -- Version Number and Development Tree Organization | 18133 @section Future Work -- Version Number and Development Tree Organization |
16432 @cindex future work, version number and development tree organization | 18134 @cindex future work, version number and development tree organization |
16433 @cindex version number and development tree organization, future work | 18135 @cindex version number and development tree organization, future work |
18136 | |
18137 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16434 | 18138 |
16435 @strong{Abstract:} The purpose of this proposal is to present a coherent | 18139 @strong{Abstract:} The purpose of this proposal is to present a coherent |
16436 plan for how development branches in XEmacs are managed. This will | 18140 plan for how development branches in XEmacs are managed. This will |
16437 cover such issues as stable versus experimental branches, creating new | 18141 cover such issues as stable versus experimental branches, creating new |
16438 branches, synchronizing patches between branches, and how version | 18142 branches, synchronizing patches between branches, and how version |
16724 without the diff getting cluttered up by these code cleanliness changes | 18428 without the diff getting cluttered up by these code cleanliness changes |
16725 that don't change any actual behavior. | 18429 that don't change any actual behavior. |
16726 | 18430 |
16727 @end enumerate | 18431 @end enumerate |
16728 | 18432 |
16729 @uref{../../www.666.com/ben,Ben Wing} | |
16730 | |
16731 @node Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Keybindings, Future Work -- Version Number and Development Tree Organization, Future Work | 18433 @node Future Work -- Improvements to the @code{xemacs.org} Website, Future Work -- Keybindings, Future Work -- Version Number and Development Tree Organization, Future Work |
16732 @section Future Work -- Improvements to the @code{xemacs.org} Website | 18434 @section Future Work -- Improvements to the @code{xemacs.org} Website |
16733 @cindex future work, improvements to the @code{xemacs.org} website | 18435 @cindex future work, improvements to the @code{xemacs.org} website |
16734 @cindex improvements to the @code{xemacs.org} website, future work | 18436 @cindex improvements to the @code{xemacs.org} website, future work |
18437 | |
18438 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16735 | 18439 |
16736 The @code{xemacs.org} web site is the face that XEmacs presents to the | 18440 The @code{xemacs.org} web site is the face that XEmacs presents to the |
16737 outside world. In my opinion, its most important function is to present | 18441 outside world. In my opinion, its most important function is to present |
16738 information about XEmacs in such a way that solicits new XEmacs users | 18442 information about XEmacs in such a way that solicits new XEmacs users |
16739 and co-contributors. Existing members of the XEmacs community can | 18443 and co-contributors. Existing members of the XEmacs community can |
16829 at @uref{../../www.freshmeat.net/default.htm,http://www.freshmeat.net}, | 18533 at @uref{../../www.freshmeat.net/default.htm,http://www.freshmeat.net}, |
16830 the various announcement news groups (for example, | 18534 the various announcement news groups (for example, |
16831 @uref{news:comp.os.linux.announce,comp.os.linux.announce}, and the | 18535 @uref{news:comp.os.linux.announce,comp.os.linux.announce}, and the |
16832 Windows announcement news group) etc. | 18536 Windows announcement news group) etc. |
16833 | 18537 |
16834 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16835 | |
16836 @node Future Work -- Keybindings, Future Work -- Byte Code Snippets, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work | 18538 @node Future Work -- Keybindings, Future Work -- Byte Code Snippets, Future Work -- Improvements to the @code{xemacs.org} Website, Future Work |
16837 @section Future Work -- Keybindings | 18539 @section Future Work -- Keybindings |
16838 @cindex future work, keybindings | 18540 @cindex future work, keybindings |
16839 @cindex keybindings, future work | 18541 @cindex keybindings, future work |
16840 | 18542 |
16846 | 18548 |
16847 @node Future Work -- Keybinding Schemes, Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings, Future Work -- Keybindings | 18549 @node Future Work -- Keybinding Schemes, Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings, Future Work -- Keybindings |
16848 @subsection Future Work -- Keybinding Schemes | 18550 @subsection Future Work -- Keybinding Schemes |
16849 @cindex future work, keybinding schemes | 18551 @cindex future work, keybinding schemes |
16850 @cindex keybinding schemes, future work | 18552 @cindex keybinding schemes, future work |
18553 | |
18554 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16851 | 18555 |
16852 @strong{Abstract:} We need a standard mechanism that allows a different | 18556 @strong{Abstract:} We need a standard mechanism that allows a different |
16853 global key binding schemes to be defined. Ideally, this would be the | 18557 global key binding schemes to be defined. Ideally, this would be the |
16854 @uref{keyboard-actions.html,keyboard action interface} that I have | 18558 @uref{keyboard-actions.html,keyboard action interface} that I have |
16855 proposed, however this would require a lot of work on the part of mode | 18559 proposed, however this would require a lot of work on the part of mode |
16864 | 18568 |
16865 @node Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Misc Key Binding Ideas, Future Work -- Keybinding Schemes, Future Work -- Keybindings | 18569 @node Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Misc Key Binding Ideas, Future Work -- Keybinding Schemes, Future Work -- Keybindings |
16866 @subsection Future Work -- Better Support for Windows Style Key Bindings | 18570 @subsection Future Work -- Better Support for Windows Style Key Bindings |
16867 @cindex future work, better support for windows style key bindings | 18571 @cindex future work, better support for windows style key bindings |
16868 @cindex better support for windows style key bindings, future work | 18572 @cindex better support for windows style key bindings, future work |
18573 | |
18574 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16869 | 18575 |
16870 @strong{Abstract:} This page describes how we could create an XEmacs | 18576 @strong{Abstract:} This page describes how we could create an XEmacs |
16871 extension that modifies the global key bindings so that a Windows user | 18577 extension that modifies the global key bindings so that a Windows user |
16872 would feel at home when using the keyboard in XEmacs. Some of these | 18578 would feel at home when using the keyboard in XEmacs. Some of these |
16873 bindings don't conflict with standard XEmacs keybindings and should be | 18579 bindings don't conflict with standard XEmacs keybindings and should be |
16931 allows the user to make a selection of which key binding scheme they | 18637 allows the user to make a selection of which key binding scheme they |
16932 would prefer as the default, either the XEmacs standard bindings, Vi | 18638 would prefer as the default, either the XEmacs standard bindings, Vi |
16933 bindings (which would be Viper mode), Windows-style bindings, Brief, | 18639 bindings (which would be Viper mode), Windows-style bindings, Brief, |
16934 CodeWright, Visual C++, or whatever we manage to implement. | 18640 CodeWright, Visual C++, or whatever we manage to implement. |
16935 | 18641 |
16936 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
16937 | |
16938 @node Future Work -- Misc Key Binding Ideas, , Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings | 18642 @node Future Work -- Misc Key Binding Ideas, , Future Work -- Better Support for Windows Style Key Bindings, Future Work -- Keybindings |
16939 @subsection Future Work -- Misc Key Binding Ideas | 18643 @subsection Future Work -- Misc Key Binding Ideas |
16940 @cindex future work, misc key binding ideas | 18644 @cindex future work, misc key binding ideas |
16941 @cindex misc key binding ideas, future work | 18645 @cindex misc key binding ideas, future work |
18646 | |
18647 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
16942 | 18648 |
16943 @itemize | 18649 @itemize |
16944 @item | 18650 @item |
16945 M-123 ... do digit arg | 18651 M-123 ... do digit arg |
16946 | 18652 |
16998 @node Future Work -- Byte Code Snippets, Future Work -- Lisp Stream API, Future Work -- Keybindings, Future Work | 18704 @node Future Work -- Byte Code Snippets, Future Work -- Lisp Stream API, Future Work -- Keybindings, Future Work |
16999 @section Future Work -- Byte Code Snippets | 18705 @section Future Work -- Byte Code Snippets |
17000 @cindex future work, byte code snippets | 18706 @cindex future work, byte code snippets |
17001 @cindex byte code snippets, future work | 18707 @cindex byte code snippets, future work |
17002 | 18708 |
18709 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18710 | |
17003 @itemize | 18711 @itemize |
17004 @item | 18712 @item |
17005 For use in time critical (e.g. redisplay) places such as display | 18713 For use in time critical (e.g. redisplay) places such as display |
17006 tables - a simple piece of code is evalled, e.g. | 18714 tables - a simple piece of code is evalled, e.g. |
17007 @example | 18715 @example |
17027 @end itemize | 18735 @end itemize |
17028 | 18736 |
17029 @menu | 18737 @menu |
17030 * Future Work -- Autodetection:: | 18738 * Future Work -- Autodetection:: |
17031 * Future Work -- Conversion Error Detection:: | 18739 * Future Work -- Conversion Error Detection:: |
18740 * Future Work -- Unicode:: | |
17032 * Future Work -- BIDI Support:: | 18741 * Future Work -- BIDI Support:: |
17033 * Future Work -- Localized Text/Messages:: | 18742 * Future Work -- Localized Text/Messages:: |
17034 @end menu | 18743 @end menu |
17035 | 18744 |
17036 @node Future Work -- Autodetection, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets, Future Work -- Byte Code Snippets | 18745 @node Future Work -- Autodetection, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets, Future Work -- Byte Code Snippets |
17038 @cindex future work, autodetection | 18747 @cindex future work, autodetection |
17039 @cindex autodetection, future work | 18748 @cindex autodetection, future work |
17040 | 18749 |
17041 There are various proposals contained here. | 18750 There are various proposals contained here. |
17042 | 18751 |
17043 @subsection New Implementation of Autodetection Mechanism | 18752 @subheading New Implementation of Autodetection Mechanism |
18753 | |
18754 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17044 | 18755 |
17045 The current auto detection mechanism in XEmacs Mule has many | 18756 The current auto detection mechanism in XEmacs Mule has many |
17046 problems. For one thing, it is wrong too much of the time. Another | 18757 problems. For one thing, it is wrong too much of the time. Another |
17047 problem, although easily fixed, is that priority lists are fixed rather | 18758 problem, although easily fixed, is that priority lists are fixed rather |
17048 than varying, depending on the particular locale; and finally, it | 18759 than varying, depending on the particular locale; and finally, it |
17181 As part of the "are you sure" dialog box or question, the user can | 18892 As part of the "are you sure" dialog box or question, the user can |
17182 display the results of the decoding to make sure it's correct. If the | 18893 display the results of the decoding to make sure it's correct. If the |
17183 user says "no, they're not sure," then the same list of choices as | 18894 user says "no, they're not sure," then the same list of choices as |
17184 previously mentioned will be presented. | 18895 previously mentioned will be presented. |
17185 | 18896 |
17186 @subheading Implementation of Coding System Priority Lists in Various Locales | 18897 @subheading RFC: Autodetection |
18898 | |
18899 Also appeared under heading "Implementation of Coding System Priority | |
18900 Lists in Various Locales" ? | |
18901 | |
18902 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull} | |
18903 | |
18904 Date: 11/1/1999 2:48 AM | |
17187 | 18905 |
17188 @example | 18906 @example |
18907 >>>>> "Hrvoje" == Hrvoje Niksic <hniksic@@srce.hr> writes: | |
18908 | |
18909 [Ben sez:] | |
18910 | |
18911 >> You are perfectly free to set up your XEmacs like this, but | |
18912 >> XEmacs/Mule @strong{will} autodetect by default if there is no | |
18913 >> Content-Type: info and no reason to believe we are dealing with | |
18914 >> binary files. | |
18915 | |
18916 Hrvoje> In that case, it will be a serious mistake to make | |
18917 Hrvoje> --with-mule the default, ever. I think more care should | |
18918 Hrvoje> be shown in meeting the need of European users. | |
18919 @end example | |
18920 | |
18921 Hrvoje, I don't understand what you are worrying about. I suspect you | |
18922 are worrying about Handa's hyperactive and obstinate Mule, not what | |
18923 Ben has in mind. Yes, Ben has said "better guessing," but that's | |
18924 simply not reasonable without substantial language environment | |
18925 information. I think trying to detect Latin-1 vs Latin-2 in the POSIX | |
18926 locale would be a big mistake, I think trying to guess Big 5 v. Shift | |
18927 JIS in a European locale would be a big mistake. | |
18928 | |
18929 If Ben doesn't mean "more appropriate use of language environment | |
18930 information" when he writes "better guessing," I, as much as you, want | |
18931 to see how he plans to do that. Ben? ("Yes/no/oops I need to think | |
18932 about it" is good enough if you have specifics you intend to put in | |
18933 the RFC you're planning to present.) | |
18934 | |
18935 Let me give a formal proposal of what I would like to see in the | |
18936 autodetection specification. | |
18937 | |
17189 @enumerate | 18938 @enumerate |
18939 @item | |
18940 Definitions | |
18941 | |
18942 @enumerate | |
18943 @item | |
18944 @dfn{Autodetection} means detecting and making available to Mule | |
18945 the external file's encoding. See (5), below. It doesn't | |
18946 imply any specific actions based on that information. | |
18947 | |
18948 @item | |
18949 The @dfn{default} case is POSIX locale, and no environment | |
18950 information in ~/.emacs. | |
18951 | |
18952 N.B. This @strong{will} cause breakage for all 1-byte users because | |
18953 the default case can no longer assume Latin-1. You @strong{may} be | |
18954 able to use the TTY font or the Xt -font option to fake this, | |
18955 and default to iso8859-1; I would hope that we would not use | |
18956 such a kludge in the beta versions, although it might be | |
18957 satisfactory for general use. In particular, encodings like | |
18958 VISCII (Vietnamese) and I believe KOI-8 (Cyrillic) are not | |
18959 ISO-2022-clean, but using C1 control characters as a heuristic | |
18960 for detecting binary files is useful. | |
18961 | |
18962 If we do allow it, I think that XEmacs should bitch and warn | |
18963 that the practices of implicitly specifying language | |
18964 environment by -font and defaulting on TTYs is deprecated and | |
18965 likely to be obsoleted. | |
18966 | |
18967 @item | |
18968 The @dfn{European} case is any Latin-* locale, either implied by | |
18969 setlocale() and friends or set in ~/.emacs. Latin-1 is | |
18970 specifically not given precedence over other Latin-*, or | |
18971 non-Latin or non-ISO-8859 for that matter. I suspect but am | |
18972 not sure that this case extends to all ISO-8859 encodings, and | |
18973 possibly to non-ISO-8859 single-byte encodings like KOI-8r (in | |
18974 particular when combined in a class with ISO-8859 encodings). | |
18975 | |
18976 @item | |
18977 The @dfn{CJK} case is any CJK locale. Japanese is specifically | |
18978 not given precedence over other Asian locales. | |
18979 | |
18980 @item | |
18981 For completeness, define the @dfn{Unicode} case (Unicode | |
18982 unfortunately has lots of junk such as precomposed characters, | |
18983 language tags, and directionality indicators in it; we | |
18984 probably don't care yet, but we should also not claim | |
18985 compliance) and the @dfn{general} case (which has a lot of | |
18986 features similar to Unicode, but lacks the advantage of a | |
18987 unified encoding). This proposal has no idea how to handle | |
18988 the special features of these, or even if that matters. The | |
18989 general case includes stuff that nobody here really knows how | |
18990 it works, like Tibetan and Ethiopic. | |
18991 @end enumerate | |
18992 | |
18993 Each of the following cases is given in the order of priority of | |
18994 detection. I'm not sure I'm serious about the top priority given the | |
18995 (optional) Unicode detection. This may be appropriate if Ben is | |
18996 right that ISO-2022 is going to disappear, but possibly not until then | |
18997 (two two-byte sequences out of 65536 is probably 1.99 too many). It | |
18998 probably isn't too risky if (6)(c) is taken pretty seriously; a Unicode | |
18999 file should contain _no_ private use characters unless the encoding is | |
19000 explicitly specified, and that's a block of 1/10 of the code space, | |
19001 which should help a lot in detecting binary files. | |
19002 | |
17190 @item | 19003 @item |
17191 Default locale | 19004 Default locale |
17192 | 19005 |
17193 @enumerate | 19006 @enumerate |
17194 @item | 19007 @item |
17291 Newlines will be detected in text files. | 19104 Newlines will be detected in text files. |
17292 @end enumerate | 19105 @end enumerate |
17293 | 19106 |
17294 @item | 19107 @item |
17295 Unicode and general locales; multilingual use | 19108 Unicode and general locales; multilingual use |
17296 @end enumerate | |
17297 | 19109 |
17298 @enumerate | 19110 @enumerate |
17299 @item | 19111 @item |
17300 Hopefully a system general enough to handle (2)--(4) will | 19112 Hopefully a system general enough to handle (2)--(4) will |
17301 handle these, too, but we should watch out for gotchas like | 19113 handle these, too, but we should watch out for gotchas like |
17311 would involve (eg) heuristics like picking a set of code | 19123 would involve (eg) heuristics like picking a set of code |
17312 points that are frequent in Shift JIS and uncommon in Big 5 | 19124 points that are frequent in Shift JIS and uncommon in Big 5 |
17313 and betting that a file containing many characters from that | 19125 and betting that a file containing many characters from that |
17314 set is Shift JIS. | 19126 set is Shift JIS. |
17315 @end enumerate | 19127 @end enumerate |
17316 @end example | 19128 |
19129 @item | |
19130 Relationship to decoding semantics | |
19131 | |
19132 @enumerate | |
19133 @item | |
19134 Autodetection should be run on every input stream unless the | |
19135 user explicitly disables it. | |
19136 | |
19137 @item | |
19138 The (conceptual) default procedure is | |
19139 | |
19140 @item | |
19141 Read the file into the buffer | |
19142 | |
19143 Announce the result of autodetection to the user. | |
19144 | |
19145 User may request decoding, with autodetected encoding(s) | |
19146 given priority in a list of available encodings. | |
19147 | |
19148 zations (see (e) below) should avoid introducing data | |
19149 tion that this default procedure would avoid. | |
19150 | |
19151 sly, it can't be perfect if any autodecoding is done; | |
19152 like Hrvoje should have an easily available option to | |
19153 to this default (or an optimized approximation which | |
19154 t actually read the whole file into a buffer) or simply | |
19155 y everything as binary (with the "font" for binary files | |
19156 a user option). | |
19157 | |
19158 @item | |
19159 This implies that we should be detecting conditions in the | |
19160 tail of the file which violate the implicit assumptions of the | |
19161 coding system autodetected (eg, in UTF-8 illegal UTF-8 | |
19162 sequences, including those corresponding to surrogates) should | |
19163 raise a warning; the buffer should probably be made read-only | |
19164 and the user prompted. | |
19165 | |
19166 This could be taken to extremes, like checking by table | |
19167 whether all characters in a Japanese file are actually | |
19168 legitimate JIS codes; that's insane (and would cause corporate | |
19169 encodings to be recognized as binary). But we should think | |
19170 about the idea that autodetection shouldn't mean XEmacs can't | |
19171 change its mind. | |
19172 | |
19173 @item | |
19174 A flexible means for the user to delegate the decision | |
19175 (conditional on the result of autodetection) to decode or not | |
19176 to XEmacs or a Lisp program should be provided (eg, the | |
19177 coding priority list and/or a file-coding-alist). | |
19178 | |
19179 @item | |
19180 Optimized operations (eg, the current lstreams) should be | |
19181 provided, with the recognition that if they depend on sampling | |
19182 the file they are risky. | |
19183 | |
19184 @item | |
19185 Mule should provide a reasonable set of default delegations | |
19186 (as in (d) above) for as many locales as possible. | |
19187 @end enumerate | |
19188 | |
19189 @item | |
19190 Implementation | |
19191 | |
19192 @enumerate | |
19193 @item | |
19194 I think all the decision logic suggested above can be | |
19195 accomplished through a coding-priority-list and appropriate | |
19196 initializations for different language environments, and a | |
19197 file-coding-alist. | |
19198 | |
19199 @item | |
19200 Many of the tests on the file's tail shouldn't be very | |
19201 expensive; in particular, all of the ones I've suggested are | |
19202 O(n) although they might involve moderate-sized auxiliary | |
19203 tables for efficiency (eg, 64kB for a single Unicode-oriented | |
19204 test). | |
19205 @end enumerate | |
19206 @end enumerate | |
19207 | |
19208 Other comments: | |
19209 | |
19210 It might be reasonable given Hrvoje's objections to require that any | |
19211 autodetection that could cause data loss (any coding system that | |
19212 involves escape sequences, and only those AFAIK: by design translation | |
19213 to Unicode is invertible) by default prompt the user (presumable with | |
19214 a novice-like ability to retain the prompt, always default to binary, | |
19215 or always default to the autodetected encoding) in the future, at | |
19216 least in locales that don't need it (POSIX, Latin-any). | |
19217 | |
19218 Ben thinks that we can remember the input data; I think it's going to | |
19219 be hard to comprehensively test that a highly optimized version works. | |
19220 Good design will help, but ISO-2022 is enormously complex, and there | |
19221 are many encodings that violate even its lax assumptions. On the | |
19222 other hand, memory is the only way to get non-rewindable streams right. | |
19223 | |
19224 Hrvoje himself said he would like to have an XEmacs that distinguishes | |
19225 between Latin-1 and Latin-2 text. Where it is possible to do that, | |
19226 this is exactly what autodetection of ISO-2022 and Unicode gives you. | |
19227 Many people would want that, even at some risk of binary corruption. | |
19228 | |
19229 >> Once again I remind you that XEmacs is a @strong{text} editor. There | |
19230 >> are lots of files that potentially may have Japanese etc. in | |
19231 >> them without this marked, e.g. C or Elisp files in the XEmacs | |
19232 >> source. Surely you're not arguing that we interpret even these | |
19233 >> files as binary by default? | |
19234 | |
19235 Hrvoje> I am. If I want to see Japanese, I'll setup my | |
19236 Hrvoje> environment that way. But I don't, and neither do 99% of | |
19237 Hrvoje> Croatian users. I can't speak for French, Italian, and | |
19238 Hrvoje> others, but I'd assume similar. | |
19239 | |
19240 Hrvoje> If there is Japanese in the source files, I will see it as | |
19241 Hrvoje> escape sequences, which is perfectly fine, because I don't | |
19242 Hrvoje> read Japanese. | |
19243 | |
19244 And some (European) people will have their terminals scrambled, | |
19245 because Shift-JIS contains sequences that can change the state of | |
19246 XTerm (as do fixed-width Unicode and Big5). This may also be a | |
19247 problem with some Windows-12xx encodings; I'm not sure they all are | |
19248 ISO-2022-clean. (This isn't a problem for XEmacs native X11 frames or | |
19249 native MS-Windows frames, and the XEmacs sources themselves are all in | |
19250 7-bit ISO-2022 now IIRC. But it is a potential source of great | |
19251 frustration for many users.) | |
19252 | |
19253 I think that should be considered too, although it is presumably lower | |
19254 priority than the data corruption of binary files. | |
19255 | |
19256 @subheading Response to RFC: Autodetection | |
19257 | |
19258 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
19259 | |
19260 Date: 11/1/1999 7:24 AM | |
19261 | |
19262 Stephen, thank you very much for writing this up. I think it is a good start, | |
19263 and definitely moving in the direction I would like to see things going: more | |
19264 proposals, less arguing. (aka "more light, less heat") However, I have some | |
19265 suggestions for cleaning this up: | |
19266 | |
19267 You should try to make it more layered. For example, you might have one | |
19268 section devoted to the workings of autodetection, which starts out like this | |
19269 (the section numbers below are totally arbitrary): | |
19270 | |
19271 @subsubheading Section 5 | |
19272 | |
19273 @code{Autodetect()} is a function whose arguments are (1) a readable stream, (2) some | |
19274 hints indicating how the autodetection is to proceed, and (3) a value | |
19275 indicating the maximum number of characters to examine at the beginning of the | |
19276 stream. (Possibly, the value in (3) may be some special symbol indicating | |
19277 that we only go as far as the next line, or a certain number of lines ahead; | |
19278 this would be used as part of "continuous autodetection", e.g. we are decoding | |
19279 the results of an interactive terminal session, where the user may | |
19280 periodically switch encodings, line terminations, etc. as different programs | |
19281 get run and/or telnet or similar sessions are entered into and exited.) We | |
19282 assume the stream is rewindable; if not, insert a "rewinding" stream in front | |
19283 of the non-rewinding stream; this kind of stream automatically buffers the | |
19284 data as necessary. | |
19285 [You can use pseudo-code terminology here. No need for straight C or ELisp.] | |
19286 [Then proceed to describe what the hints look like -- e.g. you could portray | |
19287 it as a property list or whatever. The idea is that, for each locale, there | |
19288 is a corresponding hints value that is used at least by default. The hints | |
19289 structure also has to be set up to allow for two or more competing hints | |
19290 specifications to be merged together. For example, the extension of a file | |
19291 might provide an additional hint or hints about how to interpret the data of | |
19292 that file, and the caller of @code{autodetect()}, when calling @code{autodetect()} on such a | |
19293 file, would need to have a way of gracefully merging the default hints | |
19294 corresponding to the locale with the more specific hints provided by the | |
19295 extension. Furthermore, users like Hrvoje might well want to provide their | |
19296 own hints to supplement and override parts of the generic hints -- e.g. "I | |
19297 don't ever want to see non-European encodings decoded; treat them as binary | |
19298 instead".] | |
19299 [Then describe algorithmically how the autodetection works. First, you could | |
19300 describe it more generally, i.e. presenting an algorithmic overview, then you | |
19301 could discuss in detail exactly how autodetection of a particular type of | |
19302 external encoding works -- e.g. "for iso2022, we first look for an escape | |
19303 character, followed by a byte in this range [. ... .] etc."] | |
19304 | |
19305 @subsubheading Section 6 | |
19306 | |
19307 This section describes the concept of a locale in XEmacs, and how it is | |
19308 derived from the user's environment. A locale in XEmacs is a pair, a country | |
19309 and a language, together determining the handling of locale-specific areas of | |
19310 XEmacs. All locale-specific areas in XEmacs make use of this XEmacs locale, | |
19311 and do not attempt to derive the locale from any other sources. The user is | |
19312 free to change the current locale at any time; accessor and mutator functions | |
19313 are provided to do this so that various locale-specific areas can optionally | |
19314 be changed together with it. | |
19315 | |
19316 [Then you describe how the XEmacs locale is extracted from .emacs, from | |
19317 @code{setlocale()}, from the LANG environment variables, from -font, or wherever | |
19318 else. All other sections assume this dirty work is done and never even | |
19319 mention it] | |
19320 | |
19321 @subsubheading Section 7 | |
19322 | |
19323 [Here you describe the default @code{autodetect()} hints value corresponding to each | |
19324 possible locale. You should probably use a schematic description here, e.g. | |
19325 an actual Lisp property list, liberally commented.] | |
19326 | |
19327 @subsubheading Section 8 etc. | |
19328 | |
19329 [Other sections cover anything I've missed. By being very careful to separate | |
19330 out the layers, you simultaneously introduce more rigor (easier to catch bugs) | |
19331 and make it easier for someone else to understand it completely.] | |
17317 | 19332 |
17318 @subheading Better Algorithm, More Flexibility, Different Levels of Certainty | 19333 @subheading Better Algorithm, More Flexibility, Different Levels of Certainty |
17319 | 19334 |
17320 @subheading Much More Flexible Coding System Priority List, per-Language Environment | 19335 @subheading Much More Flexible Coding System Priority List, per-Language Environment |
17321 | 19336 |
17322 @subheading User Ability to Select Encoding when System Unsure or Encounters Errors | 19337 @subheading User Ability to Select Encoding when System Unsure or Encounters Errors |
17323 | 19338 |
17324 @subheading Another Autodetection Proposal | 19339 @subheading Another Autodetection Proposal |
19340 | |
19341 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17325 | 19342 |
17326 however, in general the detection code has major problems and needs lots | 19343 however, in general the detection code has major problems and needs lots |
17327 of work: | 19344 of work: |
17328 | 19345 |
17329 @itemize @bullet | 19346 @itemize @bullet |
17330 @item | 19347 @item |
17331 instead of merely "yes" or "no" for particular categories, we need a | 19348 instead of merely "yes" or "no" for particular categories, we need a |
17332 more flexible system, with various levels of likelihood. Currently | 19349 more flexible system, with various levels of likelihood. Currently |
17333 I've created a system with six levels, as follows: | 19350 I've created a system with six levels, as follows: |
17334 | 19351 |
17335 [see file-coding.h] | 19352 [see @file{file-coding.h}] |
17336 | 19353 |
17337 Let's consider what this might mean for an ASCII text detector. (In | 19354 Let's consider what this might mean for an ASCII text detector. (In |
17338 order to have accurate detection, especially given the iteration I | 19355 order to have accurate detection, especially given the iteration I |
17339 proposed below, we need active detectors for @strong{all} types of data we | 19356 proposed below, we need active detectors for @strong{all} types of data we |
17340 might reasonably encounter, such as ASCII text files, binary files, | 19357 might reasonably encounter, such as ASCII text files, binary files, |
17499 | 19516 |
17500 ben [at least that's what sjt thinks] | 19517 ben [at least that's what sjt thinks] |
17501 | 19518 |
17502 ***** | 19519 ***** |
17503 | 19520 |
19521 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull} | |
19522 | |
17504 While this is clearly something of an improvement over earlier designs, | 19523 While this is clearly something of an improvement over earlier designs, |
17505 it doesn't deal with the most important issue: to do better than categories | 19524 it doesn't deal with the most important issue: to do better than categories |
17506 (which in the medium term is mostly going to mean "which flavor of Unicode | 19525 (which in the medium term is mostly going to mean "which flavor of Unicode |
17507 is this?"), we need to look at statistical behavior rather than ruling out | 19526 is this?"), we need to look at statistical behavior rather than ruling out |
17508 categories via presence of specific sequences. This means the stream | 19527 categories via presence of specific sequences. This means the stream |
17525 and "magic" like Unicode signatures or file(1) magic. | 19544 and "magic" like Unicode signatures or file(1) magic. |
17526 @end enumerate | 19545 @end enumerate |
17527 | 19546 |
17528 --sjt | 19547 --sjt |
17529 | 19548 |
17530 @node Future Work -- Conversion Error Detection, Future Work -- BIDI Support, Future Work -- Autodetection, Future Work -- Byte Code Snippets | 19549 @node Future Work -- Conversion Error Detection, Future Work -- Unicode, Future Work -- Autodetection, Future Work -- Byte Code Snippets |
17531 @subsection Future Work -- Conversion Error Detection | 19550 @subsection Future Work -- Conversion Error Detection |
17532 @cindex future work, conversion error detection | 19551 @cindex future work, conversion error detection |
17533 @cindex conversion error detection, future work | 19552 @cindex conversion error detection, future work |
17534 | 19553 |
17535 @subheading "No Corruption" Scheme for Preserving External Encoding when Non-Invertible Transformation Applied | 19554 @subheading "No Corruption" Scheme for Preserving External Encoding when Non-Invertible Transformation Applied |
19555 | |
19556 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17536 | 19557 |
17537 A preliminary and simple implementation is: | 19558 A preliminary and simple implementation is: |
17538 | 19559 |
17539 @quotation | 19560 @quotation |
17540 But you could implement it much more simply and usefully by just | 19561 But you could implement it much more simply and usefully by just |
17599 correspondences to get the internal state right. | 19620 correspondences to get the internal state right. |
17600 @end enumerate | 19621 @end enumerate |
17601 @end quotation | 19622 @end quotation |
17602 | 19623 |
17603 @subheading Another Error-Catching Idea | 19624 @subheading Another Error-Catching Idea |
19625 | |
19626 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17604 | 19627 |
17605 Nov 4, 1999 | 19628 Nov 4, 1999 |
17606 | 19629 |
17607 Finally, I don't think "save the input" is as hard as you make it out to | 19630 Finally, I don't think "save the input" is as hard as you make it out to |
17608 be. Conceptually, in fact, it's simple: for each minimal group of bytes | 19631 be. Conceptually, in fact, it's simple: for each minimal group of bytes |
17619 cases. The hardest part, in fact, is making all the string/text | 19642 cases. The hardest part, in fact, is making all the string/text |
17620 handling in XEmacs be robust w.r.t. text properties. | 19643 handling in XEmacs be robust w.r.t. text properties. |
17621 | 19644 |
17622 @subheading Strategies for Error Annotation and Coding Orthogonalization | 19645 @subheading Strategies for Error Annotation and Coding Orthogonalization |
17623 | 19646 |
17624 From sjt (?): | 19647 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull} |
17625 | 19648 |
17626 We really want to separate out a number of things. Conceptually, | 19649 We really want to separate out a number of things. Conceptually, |
17627 there is a nested syntax. | 19650 there is a nested syntax. |
17628 | 19651 |
17629 At the top level is the ISO 2022 extension syntax, including charset | 19652 At the top level is the ISO 2022 extension syntax, including charset |
17660 It's possible that, by doing the processing with tables of functions or | 19683 It's possible that, by doing the processing with tables of functions or |
17661 the like, the parser can be used for both detection and translation. | 19684 the like, the parser can be used for both detection and translation. |
17662 | 19685 |
17663 @subheading Handling Writing a File Safely, Without Data Loss | 19686 @subheading Handling Writing a File Safely, Without Data Loss |
17664 | 19687 |
17665 From ben: | 19688 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} |
17666 | 19689 |
17667 @quotation | 19690 @quotation |
17668 When writing a file, we need error detection; otherwise somebody | 19691 When writing a file, we need error detection; otherwise somebody |
17669 will create a Unicode file without realizing the coding system | 19692 will create a Unicode file without realizing the coding system |
17670 of the buffer is Raw, and then lose all the non-ASCII/Latin-1 | 19693 of the buffer is Raw, and then lose all the non-ASCII/Latin-1 |
17715 same thing (error checking, list of alternatives, etc.) needs | 19738 same thing (error checking, list of alternatives, etc.) needs |
17716 to happen when reading! all of this will be a lot of work! | 19739 to happen when reading! all of this will be a lot of work! |
17717 @end enumerate | 19740 @end enumerate |
17718 @end quotation | 19741 @end quotation |
17719 | 19742 |
17720 --ben | 19743 Author: @uref{mailto:stephen@@xemacs.org,Stephen Turnbull} |
17721 | 19744 |
17722 I don't much like Ben's scheme. First, this isn't an issue of I/O, | 19745 I don't much like Ben's scheme. First, this isn't an issue of I/O, |
17723 it's a coding issue. It can happen in many places, not just on stream | 19746 it's a coding issue. It can happen in many places, not just on stream |
17724 I/O. Error checking should take place on all translations. Second, | 19747 I/O. Error checking should take place on all translations. Second, |
17725 the two-pass algorithm should be avoided if possible. In some cases | 19748 the two-pass algorithm should be avoided if possible. In some cases |
17747 characters. So (up to some maximum) we should keep a list of unsafe | 19770 characters. So (up to some maximum) we should keep a list of unsafe |
17748 text positions, and provide a convenient function for traversing them. | 19771 text positions, and provide a convenient function for traversing them. |
17749 | 19772 |
17750 --sjt | 19773 --sjt |
17751 | 19774 |
17752 @node Future Work -- BIDI Support, Future Work -- Localized Text/Messages, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets | 19775 @node Future Work -- Unicode, Future Work -- BIDI Support, Future Work -- Conversion Error Detection, Future Work -- Byte Code Snippets |
19776 @subsection Future Work -- Unicode | |
19777 @cindex future work, unicode | |
19778 @cindex unicode, future work | |
19779 | |
19780 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
19781 | |
19782 Following is an old proposal. Unicode has been implemented already, in | |
19783 a different fashion; but there are some ideas here for more general | |
19784 support, e.g. properties of Unicode characters other than their mappings | |
19785 to particular charsets. | |
19786 | |
19787 | |
19788 We recognize 128, [256], 128x128, [256x256] for source charsets; | |
19789 | |
19790 for Unicode, 256x256 or 16x256x256. | |
19791 | |
19792 In all cases, use tables of tables and substitute a default subtable | |
19793 if entire row is empty. | |
19794 | |
19795 If destination is Unicode, either 16 or 32 bits. | |
19796 | |
19797 If destination is charset, either 8 or 16 bits. | |
19798 | |
19799 For the moment, since we only do 94, 96, 94x94 or 96x96, only do 128 | |
19800 or 128x128 for source charsets and use the range 33-126 or 32-127. | |
19801 (Except ASCII - we special case that and have no table because we can | |
19802 algorithmically translate) | |
19803 | |
19804 Also have a 16x256x256 table -> 32 bits of Unicode char properties. | |
19805 | |
19806 A particular charset contains two associated mapping tables, for both | |
19807 directions. | |
19808 | |
19809 API is set-unicode-mapping: | |
19810 | |
19811 @example | |
19812 (set-unicode-mapping | |
19813 unicode char | |
19814 unicode charset-code charset-offset | |
19815 unicode vector of char | |
19816 unicode list of char | |
19817 unicode string of char | |
19818 unicode vector or list of codes charset-offset | |
19819 @end example | |
19820 | |
19821 Establishes a mapping between a unicode codepoint (an integer) and | |
19822 one or more chars in a charset. The mapping is automatically | |
19823 established in both directions. Chars in a charset can be specified | |
19824 either with an actual character or a codepoint (i.e. an integer) | |
19825 and the charset it's within. If a sequence of chars or charset | |
19826 points is given, multiple mappings are established for consecutive | |
19827 unicode codepoints starting with the given one. Charset codepoints | |
19828 are specified as most-significant x 256 + least significant, with | |
19829 both bytes in the range 33-126 (for 94 or 94x94) or 32-127 (for 96 | |
19830 or 96x96), unless an offset is given, which will be subtracted from | |
19831 each byte. (Most common values are 128, for codepoints given with | |
19832 the high bit set, or -32, for codepoints given as 1-94 or 0-95.) | |
19833 | |
19834 Other API's: | |
19835 | |
19836 @example | |
19837 (write-unicode-mapping file charset) | |
19838 @end example | |
19839 | |
19840 Write the mapping table for a particular charset to the specified | |
19841 file. The tables are written in an internal format that allows for | |
19842 efficient loading, for portability across platforms and XEmacs | |
19843 invocations, for conserving space, for appending multiple tables one | |
19844 directly after another with no need for a directory anywhere in the | |
19845 file, and for reorganizing a file as in this format (with a magic | |
19846 sequence at the beginning). The data will be appended at the end of | |
19847 a file, so that multiple tables can be written to a file; remove the | |
19848 file first to avoid this. | |
19849 | |
19850 @example | |
19851 (write-unicode-properties file unicode-codepoint length) | |
19852 @end example | |
19853 | |
19854 Write the Unicode properties (not including charset mappings) for | |
19855 the specified range of contiguous Unicode codepoints to the end of | |
19856 the file (i.e. append mode) in a binary format similar to what was | |
19857 mentioned in the write-unicode-mapping description and with the same | |
19858 features. | |
19859 | |
19860 Extension to set-unicode-mapping: | |
19861 | |
19862 @example | |
19863 (set-unicode-mapping | |
19864 list-or-vector-of-unicode-codepoints char | |
19865 "" charset-code charset-offset | |
19866 "" sequence of char | |
19867 "" list-or-vector-of-codes | |
19868 charset-offset | |
19869 @end example | |
19870 | |
19871 The first two forms are conceptually the inverse of the forms above | |
19872 to specify characters for a contiguous range of Unicode codepoints. | |
19873 These new forms let you specify the Unicode codepoints for a | |
19874 contiguous range of chars in a charset. "Contiguous" here means | |
19875 that if we run off the end of a row, we go to the first entry of the | |
19876 next row, rather than to an invalid code point. For example, in a | |
19877 94x94 charset, valid rows and columns are in the range 0x21-0x7e; | |
19878 after 0x457c 0x457d 4x457e goes 0x4621, not something like 0x457f, | |
19879 which is invalid. | |
19880 | |
19881 The final two forms are the most general, letting you specify an | |
19882 arbitrary set of both Unicode points and charset chars, and the two | |
19883 are matched up just like a series of individual calls. However, if | |
19884 the lists or vectors do not have the same length, an error is | |
19885 signaled. | |
19886 | |
19887 @example | |
19888 (load-unicode-mapping file &optional charset) | |
19889 @end example | |
19890 | |
19891 If charset is omitted, loads all charset mapping tables found and | |
19892 returns a list of the charsets found. If charset is specified, | |
19893 searches through the file for the appropriate mapping tables. (This | |
19894 is extremely fast because each entry in the file gives an offset to | |
19895 the next one). Returns t if found. | |
19896 | |
19897 @example | |
19898 (load-unicode-properties file unicode-codepoint) | |
19899 @end example | |
19900 | |
19901 @example | |
19902 (list-unicode-entries file) | |
19903 @end example | |
19904 | |
19905 @example | |
19906 (autoload-unicode-mapping charset) | |
19907 @end example | |
19908 | |
19909 ... | |
19910 | |
19911 (unfinished) | |
19912 | |
19913 @node Future Work -- BIDI Support, Future Work -- Localized Text/Messages, Future Work -- Unicode, Future Work -- Byte Code Snippets | |
17753 @subsection Future Work -- BIDI Support | 19914 @subsection Future Work -- BIDI Support |
17754 @cindex future work, bidi support | 19915 @cindex future work, bidi support |
17755 @cindex bidi support, future work | 19916 @cindex bidi support, future work |
19917 | |
19918 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
17756 | 19919 |
17757 @enumerate | 19920 @enumerate |
17758 @item | 19921 @item |
17759 Use text properties to handle nesting levels, overrides | 19922 Use text properties to handle nesting levels, overrides |
17760 BIDI-specific text properties (as per Unicode BIDI algorithm) | 19923 BIDI-specific text properties (as per Unicode BIDI algorithm) |
17812 | 19975 |
17813 (much of this comment is outdated, and a lot of it is actually | 19976 (much of this comment is outdated, and a lot of it is actually |
17814 implemented) | 19977 implemented) |
17815 | 19978 |
17816 @subsection Proposal for How This All Ought to Work | 19979 @subsection Proposal for How This All Ought to Work |
19980 | |
19981 Author: @uref{mailto:jwz@@jwz.org,Jamie Zawinski} | |
17817 | 19982 |
17818 this isn't implemented yet, but this is the plan-in-progress | 19983 this isn't implemented yet, but this is the plan-in-progress |
17819 | 19984 |
17820 In general, it's accepted that the best way to internationalize is for all | 19985 In general, it's accepted that the best way to internationalize is for all |
17821 messages to be referred to by a symbolic name (or number) and come out of a | 19986 messages to be referred to by a symbolic name (or number) and come out of a |
17860 one we know how to translate, then we translate it? I think this is a | 20025 one we know how to translate, then we translate it? I think this is a |
17861 worthy goal. It remains to be seen how well it will work in practice. | 20026 worthy goal. It remains to be seen how well it will work in practice. |
17862 | 20027 |
17863 So, we should endeavor to minimize the impact on the lisp code. Certain | 20028 So, we should endeavor to minimize the impact on the lisp code. Certain |
17864 primitive lisp routines (the stuff in lisp/prim/, and especially in | 20029 primitive lisp routines (the stuff in lisp/prim/, and especially in |
17865 cmdloop.el and minibuf.el) may need to be changed to know about translation, | 20030 @file{cmdloop.el} and @file{minibuf.el}) may need to be changed to know about translation, |
17866 but that's an ideologically clean thing to do because those are considered | 20031 but that's an ideologically clean thing to do because those are considered |
17867 a part of the emacs substrate. | 20032 a part of the emacs substrate. |
17868 | 20033 |
17869 However, if we find ourselves wanting to make changes to, say, RMAIL, then | 20034 However, if we find ourselves wanting to make changes to, say, RMAIL, then |
17870 something has gone wrong. (Except to do things like remove assumptions | 20035 something has gone wrong. (Except to do things like remove assumptions |
17880 the translation. The new plan is to separate these two things more: the | 20045 the translation. The new plan is to separate these two things more: the |
17881 tags that we search for to build the catalog will be stuff that was in there | 20046 tags that we search for to build the catalog will be stuff that was in there |
17882 already, and the translation will get done in some more centralized, lower | 20047 already, and the translation will get done in some more centralized, lower |
17883 level place. | 20048 level place. |
17884 | 20049 |
17885 This program (make-msgfile.c) addresses the first part, extracting the | 20050 This program (@file{make-msgfile.c}) addresses the first part, extracting the |
17886 strings. | 20051 strings. |
17887 | 20052 |
17888 For the emacs C code, we need to recognize the following patterns: | 20053 For the emacs C code, we need to recognize the following patterns: |
17889 | 20054 |
17890 @example | 20055 @example |
17929 | 20094 |
17930 I expect there will be a lot like the above; basically, any function which | 20095 I expect there will be a lot like the above; basically, any function which |
17931 is a commonly used wrapper around an eventual call to @code{message} or | 20096 is a commonly used wrapper around an eventual call to @code{message} or |
17932 @code{read-from-minibuffer} needs to be recognized by this program. | 20097 @code{read-from-minibuffer} needs to be recognized by this program. |
17933 | 20098 |
17934 | |
17935 @example | 20099 @example |
17936 (dgettext "domain-name" "string") #### do we still need this? | 20100 (dgettext "domain-name" "string") #### do we still need this? |
17937 | 20101 |
17938 things that should probably be restructured: | 20102 things that should probably be restructured: |
17939 @code{princ} in cmdloop.el | 20103 @code{princ} in @file{cmdloop.el} |
17940 @code{insert} in debug.el | 20104 @code{insert} in @file{debug.el} |
17941 face-interactive | 20105 face-interactive |
17942 help.el, syntax.el all messed up | 20106 @file{help.el}, @file{syntax.el} all messed up |
17943 @end example | 20107 @end example |
17944 | 20108 |
20109 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20110 | |
17945 ben: (format) is a tricky case. If I use format to create a string | 20111 ben: (format) is a tricky case. If I use format to create a string |
17946 that I then send to a file, I probably don't want the string translated. | 20112 that I then send to a file, I probably don't want the string translated. |
17947 On the other hand, If the string gets used as an argument to (y-or-n-p) | 20113 On the other hand, If the string gets used as an argument to (y-or-n-p) |
17948 or some such function, I do want it translated, and it needs to be | 20114 or some such function, I do want it translated, and it needs to be |
17949 translated before the %s and such are replaced. The proper solution | 20115 translated before the %s and such are replaced. The proper solution |
18051 We can solve this by adding a bit to Lisp_String objects which identifies | 20217 We can solve this by adding a bit to Lisp_String objects which identifies |
18052 them as having been read as literal constants from a .el or .elc file (as | 20218 them as having been read as literal constants from a .el or .elc file (as |
18053 opposed to having been constructed at run time as it would in the above | 20219 opposed to having been constructed at run time as it would in the above |
18054 case.) To solve this: | 20220 case.) To solve this: |
18055 | 20221 |
18056 @example | 20222 @itemize @bullet |
18057 - @code{Fmessage()} takes a lisp string as its first argument. | 20223 @item |
18058 If that string is a constant, that is, was read from a source file | 20224 @code{Fmessage()} takes a lisp string as its first argument. |
18059 as a literal, then it calls @code{message()} with it, which translates. | 20225 If that string is a constant, that is, was read from a source file |
18060 Otherwise, it calls @code{message_no_translate()}, which does not translate. | 20226 as a literal, then it calls @code{message()} with it, which translates. |
18061 | 20227 Otherwise, it calls @code{message_no_translate()}, which does not translate. |
18062 - @code{Ferror()} (actually, @code{Fsignal()} when condition is Qerror) works similarly. | 20228 |
18063 @end example | 20229 @item |
20230 @code{Ferror()} (actually, @code{Fsignal()} when condition is Qerror) works similarly. | |
20231 @end itemize | |
18064 | 20232 |
18065 More specifically, we do: | 20233 More specifically, we do: |
18066 | 20234 |
18067 @quotation | 20235 @quotation |
18068 Scan specified C and Lisp files, extracting the following messages: | 20236 Scan specified C and Lisp files, extracting the following messages: |
18100 it might run into problems if Arg is used for other sorts | 20268 it might run into problems if Arg is used for other sorts |
18101 of functions. | 20269 of functions. |
18102 @item | 20270 @item |
18103 @code{snarf()} should be modified so that it doesn't output null | 20271 @code{snarf()} should be modified so that it doesn't output null |
18104 strings and non-textual strings (see the comment at the top | 20272 strings and non-textual strings (see the comment at the top |
18105 of make-msgfile.c). | 20273 of @file{make-msgfile.c}). |
18106 @item | 20274 @item |
18107 parsing of (insert) should snarf all of the arguments. | 20275 parsing of (insert) should snarf all of the arguments. |
18108 @item | 20276 @item |
18109 need to add set-keymap-prompt and deal with gettext of that. | 20277 need to add set-keymap-prompt and deal with gettext of that. |
18110 @item | 20278 @item |
18139 | 20307 |
18140 @node Future Work -- Lisp Stream API, Future Work -- Multiple Values, Future Work -- Byte Code Snippets, Future Work | 20308 @node Future Work -- Lisp Stream API, Future Work -- Multiple Values, Future Work -- Byte Code Snippets, Future Work |
18141 @section Future Work -- Lisp Stream API | 20309 @section Future Work -- Lisp Stream API |
18142 @cindex future work, Lisp stream API | 20310 @cindex future work, Lisp stream API |
18143 @cindex Lisp stream API, future work | 20311 @cindex Lisp stream API, future work |
20312 | |
20313 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18144 | 20314 |
18145 Expose XEmacs internal lstreams to Lisp as stream objects. (In | 20315 Expose XEmacs internal lstreams to Lisp as stream objects. (In |
18146 addition to the functions given below, each stream object has | 20316 addition to the functions given below, each stream object has |
18147 properties that can be associated with it using the standard put, get | 20317 properties that can be associated with it using the standard put, get |
18148 etc. API. For GNU Emacs, where put and get have not been extended to | 20318 etc. API. For GNU Emacs, where put and get have not been extended to |
18530 @node Future Work -- Multiple Values, Future Work -- Macros, Future Work -- Lisp Stream API, Future Work | 20700 @node Future Work -- Multiple Values, Future Work -- Macros, Future Work -- Lisp Stream API, Future Work |
18531 @section Future Work -- Multiple Values | 20701 @section Future Work -- Multiple Values |
18532 @cindex future work, multiple values | 20702 @cindex future work, multiple values |
18533 @cindex multiple values, future work | 20703 @cindex multiple values, future work |
18534 | 20704 |
20705 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20706 | |
18535 On low level, all funs that can return multiple values are defined | 20707 On low level, all funs that can return multiple values are defined |
18536 with DEFUN_MULTIPLE_VALUES and have an extra parameter, a struct | 20708 with DEFUN_MULTIPLE_VALUES and have an extra parameter, a struct |
18537 mv_context *. | 20709 mv_context *. |
18538 | 20710 |
18539 It has to be this way to ensure that only the fun itself, and no called | 20711 It has to be this way to ensure that only the fun itself, and no called |
18574 | 20746 |
18575 @node Future Work -- Macros, Future Work -- Specifiers, Future Work -- Multiple Values, Future Work | 20747 @node Future Work -- Macros, Future Work -- Specifiers, Future Work -- Multiple Values, Future Work |
18576 @section Future Work -- Macros | 20748 @section Future Work -- Macros |
18577 @cindex future work, macros | 20749 @cindex future work, macros |
18578 @cindex macros, future work | 20750 @cindex macros, future work |
20751 | |
20752 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18579 | 20753 |
18580 @enumerate | 20754 @enumerate |
18581 @item | 20755 @item |
18582 Option to control whether beep really kills a macro execution. | 20756 Option to control whether beep really kills a macro execution. |
18583 @item | 20757 @item |
18592 | 20766 |
18593 @node Future Work -- Specifiers, Future Work -- Display Tables, Future Work -- Macros, Future Work | 20767 @node Future Work -- Specifiers, Future Work -- Display Tables, Future Work -- Macros, Future Work |
18594 @section Future Work -- Specifiers | 20768 @section Future Work -- Specifiers |
18595 @cindex future work, specifiers | 20769 @cindex future work, specifiers |
18596 @cindex specifiers, future work | 20770 @cindex specifiers, future work |
20771 | |
20772 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18597 | 20773 |
18598 @subheading Ideas To Work On When Their Time Has Come | 20774 @subheading Ideas To Work On When Their Time Has Come |
18599 | 20775 |
18600 @itemize | 20776 @itemize |
18601 @item | 20777 @item |
18799 @node Future Work -- Display Tables, Future Work -- Making Elisp Function Calls Faster, Future Work -- Specifiers, Future Work | 20975 @node Future Work -- Display Tables, Future Work -- Making Elisp Function Calls Faster, Future Work -- Specifiers, Future Work |
18800 @section Future Work -- Display Tables | 20976 @section Future Work -- Display Tables |
18801 @cindex future work, display tables | 20977 @cindex future work, display tables |
18802 @cindex display tables, future work | 20978 @cindex display tables, future work |
18803 | 20979 |
20980 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20981 | |
18804 #### It would also be really nice if you could specify that the | 20982 #### It would also be really nice if you could specify that the |
18805 characters come out in hex instead of in octal. Mule does that by | 20983 characters come out in hex instead of in octal. Mule does that by |
18806 adding a @code{ctl-hexa} variable similar to @code{ctl-arrow}, but | 20984 adding a @code{ctl-hexa} variable similar to @code{ctl-arrow}, but |
18807 that's bogus -- we need a more general solution. I think you need to | 20985 that's bogus -- we need a more general solution. I think you need to |
18808 extend the concept of display tables into a more general conversion | 20986 extend the concept of display tables into a more general conversion |
18841 @end example | 21019 @end example |
18842 | 21020 |
18843 Since more than one display table is possible, you have | 21021 Since more than one display table is possible, you have |
18844 great flexibility in mapping ranges of characters. | 21022 great flexibility in mapping ranges of characters. |
18845 | 21023 |
18846 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
18847 | |
18848 @node Future Work -- Making Elisp Function Calls Faster, Future Work -- Lisp Engine Replacement, Future Work -- Display Tables, Future Work | 21024 @node Future Work -- Making Elisp Function Calls Faster, Future Work -- Lisp Engine Replacement, Future Work -- Display Tables, Future Work |
18849 @section Future Work -- Making Elisp Function Calls Faster | 21025 @section Future Work -- Making Elisp Function Calls Faster |
18850 @cindex future work, making Elisp function calls faster | 21026 @cindex future work, making Elisp function calls faster |
18851 @cindex making Elisp function calls faster, future work | 21027 @cindex making Elisp function calls faster, future work |
21028 | |
21029 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
18852 | 21030 |
18853 @strong{Abstract: }This page describes many optimizations that can be | 21031 @strong{Abstract: }This page describes many optimizations that can be |
18854 made to the existing Elisp function call mechanism without too much | 21032 made to the existing Elisp function call mechanism without too much |
18855 effort. The most important optimizations can probably be implemented | 21033 effort. The most important optimizations can probably be implemented |
18856 with only a day or two of work. I think it's important to do this work | 21034 with only a day or two of work. I think it's important to do this work |
18949 | 21127 |
18950 Calling @code{Fset()} to change the variable's value. | 21128 Calling @code{Fset()} to change the variable's value. |
18951 | 21129 |
18952 @end enumerate | 21130 @end enumerate |
18953 | 21131 |
18954 | |
18955 @end enumerate | 21132 @end enumerate |
18956 | |
18957 | |
18958 | 21133 |
18959 The entire series of calls to @code{specbind()} should be inline and | 21134 The entire series of calls to @code{specbind()} should be inline and |
18960 merged into the argument processing code as a single tight loop, with no | 21135 merged into the argument processing code as a single tight loop, with no |
18961 function calls in the vast majority of cases. The @code{specbind()} | 21136 function calls in the vast majority of cases. The @code{specbind()} |
18962 logic should be streamlined as follows: | 21137 logic should be streamlined as follows: |
18996 issue here is with symbols whose names begin with a colon. These | 21171 issue here is with symbols whose names begin with a colon. These |
18997 symbols should simply be disallowed completely as parameter names.) | 21172 symbols should simply be disallowed completely as parameter names.) |
18998 | 21173 |
18999 @end enumerate | 21174 @end enumerate |
19000 | 21175 |
19001 | |
19002 @end enumerate | 21176 @end enumerate |
19003 | |
19004 | |
19005 | 21177 |
19006 Other optimizations that could be done are: | 21178 Other optimizations that could be done are: |
19007 | 21179 |
19008 @itemize | 21180 @itemize |
19009 @item | 21181 @item |
19083 true and is false. (Note: the optimization detailed in this item is | 21255 true and is false. (Note: the optimization detailed in this item is |
19084 probably not worth doing on the first pass.) | 21256 probably not worth doing on the first pass.) |
19085 | 21257 |
19086 @end itemize | 21258 @end itemize |
19087 | 21259 |
19088 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
19089 | |
19090 @node Future Work -- Lisp Engine Replacement, , Future Work -- Making Elisp Function Calls Faster, Future Work | 21260 @node Future Work -- Lisp Engine Replacement, , Future Work -- Making Elisp Function Calls Faster, Future Work |
19091 @section Future Work -- Lisp Engine Replacement | 21261 @section Future Work -- Lisp Engine Replacement |
19092 @cindex future work, lisp engine replacement | 21262 @cindex future work, lisp engine replacement |
19093 @cindex lisp engine replacement, future work | 21263 @cindex lisp engine replacement, future work |
19094 | 21264 |
19095 @menu | 21265 @menu |
19096 * Future Work -- Lisp Engine Discussion:: | 21266 * Future Work -- Lisp Engine Discussion:: |
19097 * Future Work -- Lisp Engine Replacement -- Implementation:: | 21267 * Future Work -- Lisp Engine Replacement -- Implementation:: |
21268 * Future Work -- Startup File Modification by Packages:: | |
19098 @end menu | 21269 @end menu |
19099 | 21270 |
19100 @node Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement, Future Work -- Lisp Engine Replacement | 21271 @node Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement, Future Work -- Lisp Engine Replacement |
19101 @subsection Future Work -- Lisp Engine Discussion | 21272 @subsection Future Work -- Lisp Engine Discussion |
19102 @cindex future work, lisp engine discussion | 21273 @cindex future work, lisp engine discussion |
19103 @cindex lisp engine discussion, future work | 21274 @cindex lisp engine discussion, future work |
19104 | 21275 |
21276 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
19105 | 21277 |
19106 @strong{Abstract: }Recently there has been a great deal of talk on the | 21278 @strong{Abstract: }Recently there has been a great deal of talk on the |
19107 XEmacs mailing lists about potential changes to the XEmacs Lisp engine. | 21279 XEmacs mailing lists about potential changes to the XEmacs Lisp engine. |
19108 Usually the discussion has centered around the question which is better, | 21280 Usually the discussion has centered around the question which is better, |
19109 Common Lisp or Scheme? This is certainly an interesting debate topic, | 21281 Common Lisp or Scheme? This is certainly an interesting debate topic, |
19223 to make this safe would be to do conservative garbage collection over | 21395 to make this safe would be to do conservative garbage collection over |
19224 the C stack and to eliminate the GCPRO declarations entirely. But how | 21396 the C stack and to eliminate the GCPRO declarations entirely. But how |
19225 many of the Lisp engines that are being considered have such a mechanism | 21397 many of the Lisp engines that are being considered have such a mechanism |
19226 built into them? | 21398 built into them? |
19227 | 21399 |
19228 | |
19229 @subsubheading Maintainability. | 21400 @subsubheading Maintainability. |
19230 | 21401 |
19231 A new Lisp engine might well improve the maintainability of XEmacs by | 21402 A new Lisp engine might well improve the maintainability of XEmacs by |
19232 offloading the maintenance of the Lisp engine. However, we need to make | 21403 offloading the maintenance of the Lisp engine. However, we need to make |
19233 very sure that this is, in fact, the case before embarking on a project | 21404 very sure that this is, in fact, the case before embarking on a project |
19282 naturally in an object-oriented system. However, neither Scheme nor | 21453 naturally in an object-oriented system. However, neither Scheme nor |
19283 Common Lisp has been designed with object orientation in mind. There is | 21454 Common Lisp has been designed with object orientation in mind. There is |
19284 a standard object system for Common Lisp, but it is extremely complex | 21455 a standard object system for Common Lisp, but it is extremely complex |
19285 and difficult to understand. | 21456 and difficult to understand. |
19286 | 21457 |
19287 | 21458 @node Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Startup File Modification by Packages, Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement |
19288 @uref{../../www.666.com/ben/default.htm,Ben Wing} | |
19289 | |
19290 | |
19291 @node Future Work -- Lisp Engine Replacement -- Implementation, , Future Work -- Lisp Engine Discussion, Future Work -- Lisp Engine Replacement | |
19292 @subsection Future Work -- Lisp Engine Replacement -- Implementation | 21459 @subsection Future Work -- Lisp Engine Replacement -- Implementation |
19293 @cindex future work, lisp engine replacement, implementation | 21460 @cindex future work, lisp engine replacement, implementation |
19294 @cindex lisp engine replacement, implementation, future work | 21461 @cindex lisp engine replacement, implementation, future work |
21462 | |
21463 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
19295 | 21464 |
19296 Let's take a look at the sort of work that would be required if we were | 21465 Let's take a look at the sort of work that would be required if we were |
19297 to replace the existing Elisp engine in XEmacs with some other engine, | 21466 to replace the existing Elisp engine in XEmacs with some other engine, |
19298 for example, the Clisp engine. I'm assuming here, of course, that we | 21467 for example, the Clisp engine. I'm assuming here, of course, that we |
19299 are not going to be changing the interface here at the same time, which | 21468 are not going to be changing the interface here at the same time, which |
19431 something special needs to happen when this is done. This could be | 21600 something special needs to happen when this is done. This could be |
19432 handled fairly easily by having our new and improved @code{DEFUN} macro | 21601 handled fairly easily by having our new and improved @code{DEFUN} macro |
19433 define a new macro for use when calling a primitive. | 21602 define a new macro for use when calling a primitive. |
19434 @end enumerate | 21603 @end enumerate |
19435 | 21604 |
19436 | |
19437 @subsubheading Make the Existing Lisp Engine be Self-contained. | 21605 @subsubheading Make the Existing Lisp Engine be Self-contained. |
19438 | 21606 |
19439 The goal of this stage is to gradually build up a self-contained Lisp | 21607 The goal of this stage is to gradually build up a self-contained Lisp |
19440 engine out of the existing XEmacs core, which has no dependencies on any | 21608 engine out of the existing XEmacs core, which has no dependencies on any |
19441 of the code elsewhere in the XEmacs core, and has a well-defined and | 21609 of the code elsewhere in the XEmacs core, and has a well-defined and |
19639 again on the old and buggy interfaced Lisp engine, it would note the | 21807 again on the old and buggy interfaced Lisp engine, it would note the |
19640 bug. | 21808 bug. |
19641 | 21809 |
19642 @end enumerate | 21810 @end enumerate |
19643 | 21811 |
19644 | 21812 @node Future Work -- Startup File Modification by Packages, , Future Work -- Lisp Engine Replacement -- Implementation, Future Work -- Lisp Engine Replacement |
19645 @uref{../../www.666.com/ben/default.htm,Ben Wing} | 21813 @subsection Future Work -- Startup File Modification by Packages |
21814 @cindex future work, startup file modification by packages | |
21815 @cindex startup file modification by packages, future work | |
21816 | |
21817 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
21818 | |
21819 OK, we need to create a design document for all of this, including: | |
21820 | |
21821 PRINCIPLE #1: Whenever you have auto-generated stuff, @strong{CLEARLY} | |
21822 indicate this in comments around the stuff. These comments get | |
21823 searched for, and used to locate the existing generated stuff to | |
21824 replace. Custom currently doesn't do this. | |
21825 | |
21826 PRINCIPLE #2: Currently, lots of functions want to add code to the | |
21827 .emacs. (e.g. I get prompted for my mail address from | |
21828 add-change-log-entry, and then prompted if I want to make this | |
21829 permanent). There needs to be a Lisp API for working with arbitrary | |
21830 code to be added to a user's startup. This API hides all the details | |
21831 of which file to put the fragment in, where in it, how to mark it with | |
21832 magical comments of the right kind so that previous fragments can be | |
21833 replaced, etc. | |
21834 | |
21835 PRINCIPLE #3: @strong{ALL} generated stuff should be loaded before any | |
21836 user-written init stuff. This way the user can override the generated | |
21837 settings. Although in the case of customize, it may work when the | |
21838 custom stuff is at the end of the init file, it surely won't work for | |
21839 arbitrary code fragments (which typically do @code{setq} or the like). | |
21840 | |
21841 PRINCIPLE #4: As much as possible, generated stuff should be place in | |
21842 separate files from non-generated stuff. Otherwise it's inevitable | |
21843 that some corruption is going to result. | |
21844 | |
21845 PRINCIPLE #5: Packages are encouraged, as much as possible, to work | |
21846 within the customize model and store all their customizations there. | |
21847 However, if they really need to have their own init files, these files | |
21848 should be placed in .xemacs/, given normal names | |
21849 (e.g. @file{saved-abbrevs.el} not .abbrevs), and there should be some magic | |
21850 comment at the top of the file that causes it to get automatically | |
21851 loaded while loading a user's init file. (Alternatively, the | |
21852 above-named API could specify a function that lets a package specify | |
21853 that they want such-and-such file loaded from the init file, and have | |
21854 the specifics of this get handled correctly.) | |
21855 | |
21856 OVERARCHING GOAL: The overarching goal is to provide a unified | |
21857 mechanism for packages to store state and setting information about | |
21858 the user and what they were doing when XEmacs exited, so that the same | |
21859 or a similar environment can be automatically set up the next time. | |
21860 In general, we are working more and more towards being a truly GUI app | |
21861 where users' settings are easy to change and get remembered correctly | |
21862 and consistently from one session to the next, rather than requiring | |
21863 nasty hacking in elisp. | |
21864 | |
21865 Hrvoje, do you have any interest in this? How about you, Martin? | |
21866 This seems like it might be up your alley. This stuff has been | |
21867 ad-hocked since kingdom come, and it's high time that we make this | |
21868 work properly so that it could be relied upon, and a lot of things | |
21869 could "just work". | |
19646 | 21870 |
19647 @node Future Work Discussion, Old Future Work, Future Work, Top | 21871 @node Future Work Discussion, Old Future Work, Future Work, Top |
19648 @chapter Future Work Discussion | 21872 @chapter Future Work Discussion |
19649 @cindex future work, discussion | 21873 @cindex future work, discussion |
19650 @cindex discussion, future work | 21874 @cindex discussion, future work |
19655 into the normal Future Work section. | 21879 into the normal Future Work section. |
19656 | 21880 |
19657 @menu | 21881 @menu |
19658 * Discussion -- garbage collection:: | 21882 * Discussion -- garbage collection:: |
19659 * Discussion -- glyphs:: | 21883 * Discussion -- glyphs:: |
21884 * Discussion -- Dialog Boxes:: | |
21885 * Discussion -- Multilingual Issues:: | |
21886 * Discussion -- Windows External Widget:: | |
21887 * Discussion -- Packages:: | |
21888 * Discussion -- Distribution Layout:: | |
19660 @end menu | 21889 @end menu |
19661 | 21890 |
19662 @node Discussion -- garbage collection, Discussion -- glyphs, Future Work Discussion, Future Work Discussion | 21891 @node Discussion -- garbage collection, Discussion -- glyphs, Future Work Discussion, Future Work Discussion |
19663 @section Discussion -- garbage collection | 21892 @section Discussion -- garbage collection |
19664 @cindex discussion, garbage collection | 21893 @cindex discussion, garbage collection |
19665 @cindex garbage collection, discussion | 21894 @cindex garbage collection, discussion |
19666 | 21895 |
19667 | |
19668 @example | |
19669 On Tue, Oct 12, 1999 at 03:36:59AM -0700, Ben Wing wrote: | 21896 On Tue, Oct 12, 1999 at 03:36:59AM -0700, Ben Wing wrote: |
19670 @end example | |
19671 | 21897 |
19672 So what am I missing here? | 21898 So what am I missing here? |
19673 | 21899 |
19674 @example | |
19675 In response, Olivier Galibert wrote: | 21900 In response, Olivier Galibert wrote: |
19676 @end example | |
19677 | 21901 |
19678 Two things: | 21902 Two things: |
19679 @enumerate | 21903 @enumerate |
19680 @item | 21904 @item |
19681 The purespace is gone | 21905 The purespace is gone |
19710 was used. | 21934 was used. |
19711 @item | 21935 @item |
19712 move the markbit outside of the lrecord. | 21936 move the markbit outside of the lrecord. |
19713 @end itemize | 21937 @end itemize |
19714 | 21938 |
19715 | |
19716 The second solution is more appealing to me for a bunch of reasons: | 21939 The second solution is more appealing to me for a bunch of reasons: |
19717 @itemize @bullet | 21940 @itemize @bullet |
19718 @item | 21941 @item |
19719 more things are shared than only what is purecopied (not yet used | 21942 more things are shared than only what is purecopied (not yet used |
19720 functions come to mind) | 21943 functions come to mind) |
19743 So no, it's not a _necessity_. But it helps. And the automatic | 21966 So no, it's not a _necessity_. But it helps. And the automatic |
19744 sharing of all objects until you write to them explicitely is, I | 21967 sharing of all objects until you write to them explicitely is, I |
19745 think, really cool. | 21968 think, really cool. |
19746 @end enumerate | 21969 @end enumerate |
19747 | 21970 |
19748 | |
19749 @example | |
19750 On 10/12/1999 5:49 PM Ben Wing wrote: | 21971 On 10/12/1999 5:49 PM Ben Wing wrote: |
19751 | 21972 |
19752 Subject: Re: hashtable-based marking and cleanups | 21973 Subject: Re: hashtable-based marking and cleanups |
19753 @end example | |
19754 | 21974 |
19755 OK, I can see the advantages. But: | 21975 OK, I can see the advantages. But: |
19756 | 21976 |
19757 @enumerate | 21977 @enumerate |
19758 @item | 21978 @item |
19793 | 22013 |
19794 @example | 22014 @example |
19795 http://www.amazon.com/exec/obidos/ASIN/0471941484/qid=939775572/sr=1-1/002-3092633-2509405 | 22015 http://www.amazon.com/exec/obidos/ASIN/0471941484/qid=939775572/sr=1-1/002-3092633-2509405 |
19796 @end example | 22016 @end example |
19797 | 22017 |
19798 @node Discussion -- glyphs, , Discussion -- garbage collection, Future Work Discussion | 22018 @node Discussion -- glyphs, Discussion -- Dialog Boxes, Discussion -- garbage collection, Future Work Discussion |
19799 @section Discussion -- glyphs | 22019 @section Discussion -- glyphs |
19800 @cindex discussion, glyphs | 22020 @cindex discussion, glyphs |
19801 @cindex glyphs, discussion | 22021 @cindex glyphs, discussion |
19802 | 22022 |
19803 Some comments (not always pretty!) by Ben: | 22023 Some comments (not always pretty!) by Ben: |
19804 | 22024 |
19805 @example | |
19806 March 20, 2000 | 22025 March 20, 2000 |
19807 | 22026 |
19808 Andy, I use the tab widgets but I've been having lots of problems. | 22027 Andy, I use the tab widgets but I've been having lots of problems. |
19809 | 22028 |
19810 1] Sometimes clicking on them does nothing. | 22029 1] Sometimes clicking on them does nothing. |
19815 to the front of the buffer list, like it should. It looks like you're | 22034 to the front of the buffer list, like it should. It looks like you're |
19816 doing this to avoid having the order of the tabs change, but this is | 22035 doing this to avoid having the order of the tabs change, but this is |
19817 wrong: If you don't reorder the buffer list, everything else gets | 22036 wrong: If you don't reorder the buffer list, everything else gets |
19818 screwed up. If you want the order of the tabs not to change, you need | 22037 screwed up. If you want the order of the tabs not to change, you need |
19819 to decouple this order from the buffer list order. | 22038 to decouple this order from the buffer list order. |
19820 @end example | 22039 |
19821 | |
19822 @example | |
19823 March 23, 2000 | 22040 March 23, 2000 |
19824 | 22041 |
19825 I'm very confused. The SIGIO timer is used @strong{only} for C-g. It has | 22042 I'm very confused. The SIGIO timer is used @strong{only} for C-g. It has |
19826 nothing to do with any other events. (sit-for 0) ought to | 22043 nothing to do with any other events. (sit-for 0) ought to |
19827 | 22044 |
19837 leery of introducing new Lisp functions to deal with specific problems. | 22054 leery of introducing new Lisp functions to deal with specific problems. |
19838 Pretty soon we end up with a whole bevy of such ill-defined functions, | 22055 Pretty soon we end up with a whole bevy of such ill-defined functions, |
19839 like we already have. I think instead, you should introduce the | 22056 like we already have. I think instead, you should introduce the |
19840 following primitive: | 22057 following primitive: |
19841 | 22058 |
22059 @example | |
19842 (wait-for-event redisplay &rest event-specs) | 22060 (wait-for-event redisplay &rest event-specs) |
22061 @end example | |
19843 | 22062 |
19844 Waits for one of the event specifications specified to happen. Returns | 22063 Waits for one of the event specifications specified to happen. Returns |
19845 something about what happened. | 22064 something about what happened. |
19846 | 22065 |
19847 REDISPLAY controls the behavior of redisplay during waiting. Something | 22066 REDISPLAY controls the behavior of redisplay during waiting. Something |
19848 like | 22067 like |
19849 | 22068 |
19850 - nil (never redisplay), | 22069 @itemize @bullet |
19851 - t (redisplay when it seems appropriate), etc. | 22070 @item |
22071 nil (never redisplay), | |
22072 @item | |
22073 t (redisplay when it seems appropriate), etc. | |
22074 @end itemize | |
19852 | 22075 |
19853 EVENT-SPECS could be | 22076 EVENT-SPECS could be |
19854 | 22077 |
22078 @example | |
19855 t -- drain all non-user events, and then return | 22079 t -- drain all non-user events, and then return |
19856 any-process -- wait till input or state change on any process | 22080 any-process -- wait till input or state change on any process |
19857 process -- wait till input or state change on process | 22081 process -- wait till input or state change on process |
19858 time -- wait till such-and-such time has elapsed | 22082 time -- wait till such-and-such time has elapsed |
19859 'user -- wait till user event has happened | 22083 'user -- wait till user event has happened |
19860 '(user predicate) -- wait till user event matching the predicate has | 22084 '(user predicate) -- wait till user event matching the predicate has |
19861 happened | 22085 happened |
19862 'event -- wait till any event has happened | 22086 'event -- wait till any event has happened |
19863 '(event predicate) -- wait till event matching the predicate has happened | 22087 '(event predicate) -- wait till event matching the predicate has happened |
22088 @end example | |
19864 | 22089 |
19865 The existing functions @code{next-event}, @code{next-command-event}, | 22090 The existing functions @code{next-event}, @code{next-command-event}, |
19866 @code{accept-process-output}, @code{sit-for}, @code{sleep-for}, etc. could all be | 22091 @code{accept-process-output}, @code{sit-for}, @code{sleep-for}, etc. could all be |
19867 written in terms of this new command. You could use this command inside | 22092 written in terms of this new command. You could use this command inside |
19868 of your glyph code to ensure that the events get processed that need do | 22093 of your glyph code to ensure that the events get processed that need do |
19869 in order for widget updates to happen. | 22094 in order for widget updates to happen. |
19870 | 22095 |
19871 But you said something about need a magic event to invoke redisplay? | 22096 But you said something about need a magic event to invoke redisplay? |
19872 Why is that? | 22097 Why is that? |
19873 @end example | 22098 |
19874 | |
19875 @example | |
19876 April 2, 2000 | 22099 April 2, 2000 |
19877 | 22100 |
19878 the internal distinction between "widget" and "layout" is bogus. there | 22101 the internal distinction between "widget" and "layout" is bogus. there |
19879 exist widgets that do drawing and do layout of their children, | 22102 exist widgets that do drawing and do layout of their children, |
19880 e.g. group-box widgets and proper tab widgets. the only sensible | 22103 e.g. group-box widgets and proper tab widgets. the only sensible |
19881 distinction is between widgets with children and those without children. | 22104 distinction is between widgets with children and those without children. |
19882 @end example | 22105 |
19883 | |
19884 @example | |
19885 April 5, 2000 | 22106 April 5, 2000 |
19886 | 22107 |
19887 andy, i'm not sure i really believe that you need to cycle the event | 22108 andy, i'm not sure i really believe that you need to cycle the event |
19888 code to get widgets to redisplay, but in any case you should | 22109 code to get widgets to redisplay, but in any case you should |
19889 | 22110 |
19898 @end enumerate | 22119 @end enumerate |
19899 | 22120 |
19900 in other words, dispatch-non-command-events must go, and i am proposing | 22121 in other words, dispatch-non-command-events must go, and i am proposing |
19901 a general function (redisplay OBJECT) to replace the existing ad-hoc | 22122 a general function (redisplay OBJECT) to replace the existing ad-hoc |
19902 functions. | 22123 functions. |
19903 @end example | 22124 |
19904 | |
19905 @example | |
19906 April 6, 2000 | 22125 April 6, 2000 |
19907 | 22126 |
19908 the tab widget code should simply be able to create a whole lot of tabs | 22127 the tab widget code should simply be able to create a whole lot of tabs |
19909 without regard to the size of the gutter, and the surrounding layout | 22128 without regard to the size of the gutter, and the surrounding layout |
19910 widget (please please make layouts be proper widgets!) should | 22129 widget (please please make layouts be proper widgets!) should |
19911 automatically map and unmap them as necessary, to fill up the available | 22130 automatically map and unmap them as necessary, to fill up the available |
19912 space. perhaps this already works and what you're doing is just for | 22131 space. perhaps this already works and what you're doing is just for |
19913 optimization? but i get the feeling this is not the case. | 22132 optimization? but i get the feeling this is not the case. |
19914 @end example | 22133 |
19915 | |
19916 @example | |
19917 April 6, 2000 | 22134 April 6, 2000 |
19918 | 22135 |
19919 the function make-gutter-only-dialog-frame is bogus. the use of the | 22136 the function make-gutter-only-dialog-frame is bogus. the use of the |
19920 gutter here to hold widgets is an implementation detail and should not | 22137 gutter here to hold widgets is an implementation detail and should not |
19921 be exposed in the interface. similarly, make-search-dialog should not | 22138 be exposed in the interface. similarly, make-search-dialog should not |
19924 hidden. you should have a simple function make-dialog-frame that takes | 22141 hidden. you should have a simple function make-dialog-frame that takes |
19925 a dialog specification, and that's all you need to do. | 22142 a dialog specification, and that's all you need to do. |
19926 | 22143 |
19927 also, these dialog boxes, and this function make-dialog-frame, should | 22144 also, these dialog boxes, and this function make-dialog-frame, should |
19928 | 22145 |
19929 a] be in dialog.el, not gutter-items.el. | 22146 @enumerate |
19930 b] when possible, be placed in the interactive spec of standard lisp | 22147 @item |
19931 functions rather than accessed directly from menubar-items.el | 22148 be in @file{dialog.el}, not gutter-items.el. |
19932 c] wrapped in calls to should-use-dialog-box-p, so the user has control | 22149 @item |
22150 when possible, be placed in the interactive spec of standard lisp | |
22151 functions rather than accessed directly from @file{menubar-items.el} | |
22152 @item | |
22153 wrapped in calls to should-use-dialog-box-p, so the user has control | |
19933 over when dialog boxes appear. | 22154 over when dialog boxes appear. |
19934 @end example | 22155 @end enumerate |
19935 | 22156 |
19936 @example | |
19937 April 7, 2000 | 22157 April 7, 2000 |
19938 | 22158 |
19939 hmmm ... in that case, the whitespace absolutely needs to be specified | 22159 hmmm ... in that case, the whitespace absolutely needs to be specified |
19940 as properties of the layout widget (e.g. :border-width and | 22160 as properties of the layout widget (e.g. :border-width and |
19941 :border-height), rather than setting an overall size. you have no idea | 22161 :border-height), rather than setting an overall size. you have no idea |
19942 what the correct size should be if the user changes font size or uses | 22162 what the correct size should be if the user changes font size or uses |
19943 translations in a different language. | 22163 translations in a different language. |
19944 | 22164 |
19945 Your modus operandi should be "hardcoded pixel sizes are @strong{always} bad." | 22165 Your modus operandi should be "hardcoded pixel sizes are @strong{always} bad." |
19946 @end example | 22166 |
19947 | |
19948 @example | |
19949 April 7, 2000 | 22167 April 7, 2000 |
19950 | 22168 |
19951 you mean the number of tabs adjusts, or the size of each tab adjusts (by | 22169 you mean the number of tabs adjusts, or the size of each tab adjusts (by |
19952 making the font smaller or something)? if the size of a single tab is | 22170 making the font smaller or something)? if the size of a single tab is |
19953 not related to the total space the tabs can fix into, then it should be | 22171 not related to the total space the tabs can fix into, then it should be |
19964 a maximum width (which should be done in 'n' sizes, not in pixels!). | 22182 a maximum width (which should be done in 'n' sizes, not in pixels!). |
19965 | 22183 |
19966 i won't stop complaining until i see nearly every one of those | 22184 i won't stop complaining until i see nearly every one of those |
19967 pixel-width and pixel-height parameters gone, and the remaining ones | 22185 pixel-width and pixel-height parameters gone, and the remaining ones |
19968 there for a very, very good reason. | 22186 there for a very, very good reason. |
19969 @end example | 22187 |
22188 April 7, 2000 | |
22189 | |
22190 Andy Piper wrote: | |
19970 | 22191 |
19971 @example | 22192 @example |
19972 April 7, 2000 | |
19973 | |
19974 Andy Piper wrote: | |
19975 | |
19976 > At 03:51 PM 4/6/00 -0700, Ben Wing wrote: | 22193 > At 03:51 PM 4/6/00 -0700, Ben Wing wrote: |
19977 > >[the function make-gutter-only-dialog-frame is bogus] | 22194 > >[the function make-gutter-only-dialog-frame is bogus] |
19978 > | 22195 > |
19979 > The problem is that some of the callbacks and such need access to the | 22196 > The problem is that some of the callbacks and such need access to the |
19980 > @strong{created} frame, so you end up in a catch 22 unless you do what I've done. | 22197 > @strong{created} frame, so you end up in a catch 22 unless you do what I've done. |
22198 @end example | |
19981 | 22199 |
19982 [Ben proposes other ways to avoid exposing all the guts, as in | 22200 [Ben proposes other ways to avoid exposing all the guts, as in |
19983 @code{make-gutter-only-dialog-frame}:] | 22201 @code{make-gutter-only-dialog-frame}:] |
19984 | 22202 |
19985 @enumerate | 22203 @enumerate |
19998 (depending on where the glyph is) where the invocation actually | 22216 (depending on where the glyph is) where the invocation actually |
19999 happened. That way, the callbacks can easily figure out the dialog | 22217 happened. That way, the callbacks can easily figure out the dialog |
20000 box and its parent, and not have to worry about embedding it in at | 22218 box and its parent, and not have to worry about embedding it in at |
20001 creation time. | 22219 creation time. |
20002 @end enumerate | 22220 @end enumerate |
20003 @end example | 22221 |
20004 | |
20005 @example | |
20006 April 15, 2000 | 22222 April 15, 2000 |
20007 I don't understand when you say "the various types of callback". Are | 22223 I don't understand when you say "the various types of callback". Are |
20008 you using the callback for various different purposes? | 22224 you using the callback for various different purposes? |
20009 | 22225 |
20010 Your widget callbacks should work just like any other callback: they | 22226 Your widget callbacks should work just like any other callback: they |
20011 take two arguments, one indicating the object to which the callback was | 22227 take two arguments, one indicating the object to which the callback was |
20012 attached (an image instance, i think), and the event that caused the | 22228 attached (an image instance, i think), and the event that caused the |
20013 callback to be invoked. | 22229 callback to be invoked. |
20014 @end example | 22230 |
20015 | |
20016 @example | |
20017 April 17, 2000 | 22231 April 17, 2000 |
20018 | 22232 |
20019 I am completely vetoing widget-callback-current-channel. How about you | 22233 I am completely vetoing widget-callback-current-channel. How about you |
20020 create a new keyword, :new-callback, that is a function of two args, | 22234 create a new keyword, :new-callback, that is a function of two args, |
20021 like i specified before. | 22235 like i specified before. |
20026 result as widget-callback-current-channel. | 22240 result as widget-callback-current-channel. |
20027 | 22241 |
20028 the problem with this and everything you've proposed is that there's no | 22242 the problem with this and everything you've proposed is that there's no |
20029 way, of course, to get at the actual widget that you were invoked from. | 22243 way, of course, to get at the actual widget that you were invoked from. |
20030 would you propose adding widget-callback-current-widget? | 22244 would you propose adding widget-callback-current-widget? |
22245 | |
22246 @node Discussion -- Dialog Boxes, Discussion -- Multilingual Issues, Discussion -- glyphs, Future Work Discussion | |
22247 @section Discussion -- Dialog Boxes | |
22248 @cindex discussion, dialog boxes | |
22249 @cindex dialog boxes, discussion | |
22250 | |
22251 @example | |
22252 From: | |
22253 Ben Wing <ben@@666.com> | |
22254 10/7/1999 5:57 PM | |
22255 | |
22256 Subject: | |
22257 Re: Animated gif patch (2) | |
22258 To: | |
22259 Andy Piper <andy@@xemacs.org> | |
22260 CC: | |
22261 xemacs-review@@xemacs.org, xemacs-beta@@xemacs.org | |
22262 | |
22263 | |
22264 | |
22265 | |
22266 The distinction between layouts and widgets makes no sense, so you should combine | |
22267 the different data required. Consider a grouping widget. Is this a layout or a | |
22268 widget? It draws, like a widget, but has children, like a layout. Same for a tab | |
22269 widget, properly implemented. It draws, handles input, has children, and makes | |
22270 choices about how to lay them out. | |
22271 | |
22272 ben | |
22273 | |
22274 From: | |
22275 Ben Wing <ben@@666.com> | |
22276 9/7/1999 8:50 PM | |
22277 | |
22278 Subject: | |
22279 Re: Layouts done | |
22280 To: | |
22281 Andy Piper <andyp@@beasys.com> | |
22282 | |
22283 | |
22284 | |
22285 | |
22286 this sounds great! where can i see the code? | |
22287 | |
22288 as for user-defined layouts, you must certainly have some sort of abstraction | |
22289 layer for layouts, with DEFINE_LAYOUT_TYPE or something similar just like device | |
22290 types and such. If not, you should certainly make one ... it would have methods | |
22291 such as query-geometry and do-layout. It should be easy to create a user-defined | |
22292 layout if you have such an abstraction. | |
22293 | |
22294 with a user-defined layout, complex built-in layouts such as grid should not be | |
22295 necessary because it's so easy to write snippets of lisp. | |
22296 | |
22297 as for the "redisplay too much" problem, perhaps you could put a dirty flag in | |
22298 each glyph indicating whether it needs to be redisplayed, recalculated, etc.? | |
22299 | |
22300 Andy Piper wrote: | |
22301 | |
22302 > You may want to check them out. I haven't done the user-defined layout | |
22303 > callback - I'm not sure what sort of API this could have. Keywords I've done: | |
22304 > | |
22305 > :orientation - vertical or horizontal | |
22306 > :justify - left, center or right | |
22307 > :border - etch-in, etch-out, bevel-in, bevel -out or text (which gives you | |
22308 > etch-in with a title) | |
22309 > | |
22310 > You can embed any glyph type in a layout. | |
22311 > | |
22312 > There is probably room for improvements for justify to do grid-type layouts | |
22313 > as per java. | |
22314 > | |
22315 > The only annoying thing is that I've hacked up font-lock support to do a | |
22316 > progress gauge in the gutter area. I've used a layout to set things out | |
22317 > correctly. The problem is if you change one of the sub-widgets, the whole | |
22318 > layout gets redisplayed because it is treated as a single glyph by redisplay. | |
22319 > | |
22320 > Oh, and I've done line based scrolling so that glyphs scroll off the page | |
22321 > in units of the average display line height rather than the whole line at | |
22322 > once. This could easily be converted to pixel scrolling but would be very | |
22323 > slow I fear. | |
22324 > | |
22325 > andy | |
22326 > -------------------------------------------------------------- | |
22327 > Dr Andy Piper | |
22328 > Senior Consultant Architect, BEA Systems Ltd | |
22329 | |
22330 | |
22331 | |
22332 | |
22333 From: | |
22334 Ben Wing <ben@@666.com> | |
22335 8/10/1999 11:11 PM | |
22336 | |
22337 Subject: | |
22338 Re: Widgets | |
22339 To: | |
22340 Andy Piper <andy@@xemacs.org> | |
22341 | |
22342 | |
22343 | |
22344 | |
22345 I think you might have misinterpreted what i meant. I meant to say that XEmacs should | |
22346 implement the @strong{concept} of a hierarchy of nested child "widgets" or "gui items" or | |
22347 whatever we want to call them -- this includes container "widgets" such as grouping | |
22348 widgets (which draw a border around the children, like in Windows), tab widgets, simple | |
22349 layout widgets (invisible, but lay out their children appropriately), etc, plus leaf | |
22350 "widgets" (buttons, sliders, etc., also standard Emacs windows). The layout calculations | |
22351 for these widgets would be handled entirely by XEmacs in a window-system-independent way. | |
22352 There is no need to create a corresponding hierarchy of window-system | |
22353 widgets/controls/whatever if it's not required, and certainly no need to try to use the | |
22354 window-system-supplied geometry management routines. It's absolutely necessary to support | |
22355 this nesting concept in XEmacs, however, or it's impossible to have easily-designable | |
22356 dialog boxes. On the other hand, I think it @strong{is} required to create much of this | |
22357 hierarchy within the actual window system, at the very least for non-invisible container | |
22358 widgets (tab, grouping, etc.), otherwise we will have very bogus, non-native-looking | |
22359 containers like your current tab-widget implementation. It's critical for XEmacs to be | |
22360 able to create dialog boxes in Windows or Motif that look just like those in any other | |
22361 standard application. Otherwise people will continue to think that XEmacs is a | |
22362 backwards-looking, badly implemented piece of software, which in many ways it is, | |
22363 particularly in regards to its user interface. | |
22364 | |
22365 Perhaps we should talk on the phone? This typing is quite hard for me still. What hours | |
22366 are you at work? My hours are approx. 2pm - 2am Pacific time (GMT - 7 hours currently). | |
22367 | |
22368 ben | |
22369 | |
22370 | |
22371 From: | |
22372 Ben Wing <ben@@666.com> | |
22373 7/21/1999 2:44 AM | |
22374 | |
22375 Subject: | |
22376 Re: Tabs 'n widgets screenshot | |
22377 To: | |
22378 Andy Piper <andy@@xemacs.org> | |
22379 CC: | |
22380 xemacs-beta@@xemacs.org, wmperry@@aventail.com | |
22381 | |
22382 | |
22383 | |
22384 | |
22385 This is real cool, but looking at this, it's clear that it doesn't look the | |
22386 way tab widgets are supposed to work. In particular, of course, they should | |
22387 have the proper borders around the stuff displayed. I've attached a screen | |
22388 shot of a typical Windows dialog box with a tab widget in it. The problem | |
22389 lies with this "expanded gutter" concept. Tabs are @strong{NOT} extra graphical junk | |
22390 placed in the gutters of a buffer but are GUI objects with @strong{children} inside | |
22391 of them. This is the right way to do things, and you would need no extra | |
22392 gutter functionality at all for this. You just need to implement the concept | |
22393 of GUI objects containing other GUI objects within them. One such GUI object | |
22394 needs to be a "Emacs-text" GUI object, which is an Emacs window and contains a | |
22395 buffer within it. At this level, you need not be concerned with the | |
22396 complexities of geometry layout. The only change that needs to be made in the | |
22397 overall strategy of frames, windows, etc. is that windows need not be exactly | |
22398 contiguous and tiled, as long as they are contained within a frame. Or more | |
22399 specifically: Given that you could always split a window contained inside a | |
22400 GUI object, we just need to expand things so that each frame has @strong{multiple} | |
22401 hierarchies of windows in it, rather than just one. A hierarchy of windows | |
22402 can nest inside of another window -- e.g. I put a tab widget or a text widget | |
22403 inside of a buffer. This should be easy to implement -- just change things so | |
22404 there are multiple hierarchies of windows where there are one, each (except | |
22405 the top-level one) being rooted inside some other window. | |
22406 | |
22407 Anyone willing to implement this? Andy? | |
22408 | |
22409 | |
22410 From: | |
22411 Ben Wing <ben@@666.com> | |
22412 6/30/1999 3:30 PM | |
22413 | |
22414 Subject: | |
22415 Re: Focus Help! | |
22416 To: | |
22417 Andy Piper <andy@@xemacs.org> | |
22418 CC: | |
22419 Ben Wing <ben@@xemacs.org>, martin@@xemacs.org, andyp@@beasys.com | |
22420 | |
22421 | |
22422 | |
22423 | |
22424 It sounds like you're doing very good work. It also sounds like the approach | |
22425 you have followed is the correct one. Now, it seems like there isn't really | |
22426 that much work left to get dialog boxes working. What you really just need to | |
22427 do is implement container widgets, that is to say, subwindows that can contain | |
22428 other subwindows. For example, the tab widget works this way. (It sounds like | |
22429 you have already implemented tab widgets, so I don't quite see how you've done | |
22430 this without the concept of container widgets.) So you might just try adding a | |
22431 framework for container widgets and then implementing very simple container | |
22432 widgets. The basic container widgets are: | |
22433 | |
22434 1. A vertical-layout widget, which draws nothing itself and lays out its | |
22435 children one above the next. | |
22436 2. A horizontal-layout widget, which draws nothing itself and lays out its | |
22437 children side-to-side. | |
22438 3. A box (or "grouping") widget, which draws a rectangle around its single child | |
22439 and optionally draws some text on the top or bottom line of the rectangle. | |
22440 4. A tab widget, which displays a series of tabs horizontally at the top of its | |
22441 area, and then below it places one of its children, | |
22442 corresponding to the selected tab. | |
22443 5. A user widget, which draws nothing itself and does no layout at all on its | |
22444 children, except that it has a "layout callback" | |
22445 property, a Lisp function, so that the programmer can control the layout. | |
22446 | |
22447 The framework is as follows: | |
22448 | |
22449 1. Every widget has at least the following properties: | |
22450 a) a size, whose value can be "unspecified", which might be implemented | |
22451 using the value -1. The default value should be "unspecified". | |
22452 b) whether it's mapped, i.e. whether it will be displayed. (Some container | |
22453 widgets, such as the tab widget, set the mapped | |
22454 property themselves on their children. Others, such as the vertical and | |
22455 horizontal layout widgets, don't change this property but pay attention to it, | |
22456 and ignore completely all children marked as unmapped.) The default value should | |
22457 be "true". | |
22458 c) whether its size can be changed by another widget's layout routine. The | |
22459 default value should be "true". | |
22460 d) a layout procedure, which (potentially at least) determines the size of | |
22461 the widget as well as the position, size and mappedness of its child widgets. | |
22462 The layout procedure is inherent in the widget and is not an external property | |
22463 of the widget (except in the case of the "user widget"): it is instead more like | |
22464 the redisplay callback that each widget has. | |
22465 2. Every container widget contains a property which is a list of child widgets. | |
22466 3. Every child widget contains the following properties: | |
22467 a) a position indicating where the child is located relative to the top | |
22468 left corner of its parent. The position's value can be "unspecified", which | |
22469 might be implemented using the value -1. The default value should be | |
22470 "unspecified". | |
22471 b) whether its position can be changed by another widget's layout routine. | |
22472 The default value should be "true". | |
22473 4. All of the properties just listed (except possibly the layout procedure) can | |
22474 be modified directly by the programmer, and there are no proscriptions against | |
22475 doing so. However, if the programmer wants to resize, reposition, map or unmap | |
22476 a widget in such a way that the layout of all the other widgets in the tree | |
22477 changes appropriately, he should use a special function to change the property, | |
22478 as described below. | |
22479 | |
22480 The redisplay mechanism pays attention to the position, size, and mappedness | |
22481 properties and to the hierarchy of widgets, mapping, resizing and repositioning | |
22482 the corresponding subwindows (the "real representation" of the widgets) as | |
22483 necessary. It also pays attention to the hierarchy of the widgets, making sure | |
22484 that container subwindows get drawn before their child subwindows. When it | |
22485 encounters widgets with an unspecified size, it should not draw them, and should | |
22486 issue a warning. When it encounters widgets with an unspecified position, it | |
22487 should draw them at position (0, 0) and should issue a warning. | |
22488 | |
22489 The above framework should be fairly simple to implement and is basically | |
22490 universal across all high-level windowing system toolkits. The stickyness comes | |
22491 with what procedures you follow for getting the layout done. | |
22492 | |
22493 Andy, I understand that implementing this may seem like a daunting task. | |
22494 Therefore, I propose that at first you implement the above framework but don't | |
22495 implement any of the layout procedures, or any of the functions that call them: | |
22496 Just make them stubs that do nothing. This way, the Lisp programmer can still | |
22497 create any dialog boxes he wants, he just has to set the sizes and positions of | |
22498 all the widgets explicitly, and then recompute them whenever the widget tree is | |
22499 resized (once you get around to allowing this). I have a lot more to write | |
22500 about exactly how the layout procedures work, but I'll send that to you later | |
22501 once you're ready. | |
22502 | |
22503 You should also think about making a way to have widget trees as top-level | |
22504 windows rather than just glyphs in a buffer. There's already the concept of | |
22505 "popup" frames. You could provide an easy way to create a popup frame with no | |
22506 menu, toolbars, scrollbars, modeline or minibuffer, and put a single glyph in | |
22507 the displayed buffer that takes up the whole Emacs window. | |
22508 | |
22509 Ben | |
22510 | |
22511 | |
22512 | |
22513 | |
22514 March 20, 2000 | |
22515 | |
22516 You wrote to me awhile ago about this and asked about documentation, and I | |
22517 dictated a response but never got it sent, so here it is: | |
22518 | |
22519 I don't think there's any more documentation on how things work under Xt but it | |
22520 should be clear. The EmacsFrame widget is the widget corresponding to the X | |
22521 window that Emacs draws into and there is a handler for expose events called | |
22522 from Xt which arranges for the invalidated areas to get redrawn. I think this | |
22523 used to happen as part of the handler itself but now it is delayed until the | |
22524 next call to redisplay. | |
22525 | |
22526 However, one thing that you absolutely must not do is remove the Xt support. | |
22527 This would be an incredibly unfriendly thing to do as it would prevent people | |
22528 from using any widget set other than Qt or GTK. Keep in mind that people run | |
22529 XEmacs on all sorts of different versions of X in Unix, and Xt is the standard | |
22530 and the only toolkit that probably exists on all of these systems. | |
22531 | |
22532 Pardon me if I've misunderstood your intentions w.r.t. this. | |
22533 | |
22534 As for how you would implement GTK support, it will not be very hard to convert | |
22535 redisplay to draw into a GTK window instead of an Xt window. In fact redisplay | |
22536 basically doesn't know about Xt at all, except in the portion that handles | |
22537 updating menubars and scrollbars and stuff that's directly related to Xt. | |
22538 | |
22539 What you'd probably want to do is create a new set of event routines to replace | |
22540 the ones in event-Xt.c. On the display side you could conceivably create a new | |
22541 device type but you probably wouldn't want to do that because it would be an | |
22542 externally visible change at the Lisp level. You might simply want to put a | |
22543 flag on each frame indicating what sort of toolkit the frame was created under | |
22544 and put conditions in the redisplay code and the code to update toolbars and | |
22545 menubars and so forth to test this flag and do the appropriate thing. | |
22546 | |
22547 | |
22548 April 12, 2000 | |
22549 | |
22550 This is way cool, buuuuutttttttt ............. | |
22551 | |
22552 what we @strong{really} need is the GUI interface on top of it. I've taken a shot at | |
22553 it with generic-print-buffer | |
22554 (print-buffer is taken by lpr, which is such a total mess that it needs to be | |
22555 trashed; or at least, the generic | |
22556 stuff in this package needs to be taken out and properly genericized). For | |
22557 the moment, generic-print-buffer | |
22558 just does something like what Kirill's been posting if we're running windows, | |
22559 and uses lpr otherwards. However, what we absofuckinglutely need is a Lisp | |
22560 interface onto @code{EnumPrinters()} so that we can get the | |
22561 list of printers and have a nice menu listing the available printers, and you | |
22562 can check the one you want. People in the Windows world don't normally even | |
22563 know the names of their local printers! | |
22564 | |
22565 Kirill, given what I've done in @file{simple.el} and @file{menubar-items.el}, do you think | |
22566 you could add the @code{EnumPrinters()} | |
22567 support and fix up the GUI? If you don't feel comfortable with the GUI, at | |
22568 least do the @code{EnumPrinters()}. | |
22569 | |
22570 But ... Kirill, I tried your formula for printing and nothing happened. | |
22571 Perhaps I didn't call redisplay-frame or something? You need to fix this up | |
22572 and make it work for multi-page documents. (Again, this is in | |
22573 generic-print-buffer.) Nothing special, it just needs to fucking work! There | |
22574 are zillions and zillions of postings every day on xemacs-nt about how to get | |
22575 printing working, and none seem to refer to the built-in support. | |
22576 | |
22577 ben | |
22578 | |
22579 | |
22580 April 19, 2000 | |
22581 | |
22582 Kirill 'Big K' Katsnelson wrote: | |
22583 | |
22584 > Some time ago, Ben Wing wrote... | |
22585 > >kirill, the interface i created is more general, like this: | |
22586 > | |
22587 > [snip] | |
22588 > | |
22589 > >Unfortunately I haven't implemented much of this; just some of the file | |
22590 > >dialog box. but i think | |
22591 > >this is better than creating new mswindows-specific primitives. if you | |
22592 > >are interested in working on | |
22593 > >this, i'll send you the code i have. | |
22594 > | |
22595 > Sure. Can you just commit it for my starting point? | |
22596 > | |
22597 > >also, the dialogs shouldn't have anything directly to do with the printer | |
22598 > >device. all they should | |
22599 > >do is return a set of values. it's the caller's responsibility to | |
22600 > >interpret them and set device | |
22601 > >properties accordingly. this way, there's a complete separation between | |
22602 > >the underlying | |
22603 > >functionality and the gui. | |
22604 > | |
22605 > Unfortunately. I thought about doing it this way, but we then lose a lot of | |
22606 > printer-specific setup in this case. The DEVMODE structure contains two | |
22607 > parts: printer independent, as defined by SDK typedef DEVMODE, and | |
22608 > some trailing bytes, of unknown structure, used by a driver. The driver | |
22609 > only returns the extra length it wants. Such options as PCL ReT resolution | |
22610 > enhancement options or PostScript negative output are not available | |
22611 > through the standard part of the devmode structure, and stored in the | |
22612 > driver part (printer dialogs are driver-specific). | |
22613 > | |
22614 > So we have total of three options: | |
22615 > - Not to implement options beyond standard DEVMODE | |
22616 > - Make DEVMODE a Lisp object. | |
22617 > - Hide DEVMODE inside the device object. | |
22618 > | |
22619 > First case looks cheesy. Letting DEVMODE fall off the printer is no good | |
22620 > either, since one needs both the device and the devmode to edit the | |
22621 > devmode, and they must match. I am still convinced that the devmode and | |
22622 > the printer should not be separated. | |
22623 | |
22624 hmm, i see ... this completely breaks abstraction though. it fails in various | |
22625 scenarios, e.g. a program wants to initialize the dialog box with certain | |
22626 non-driver-specific properties, without caring about the particular printer. | |
22627 | |
22628 i think you should create a new print-properties object that encapsulates all | |
22629 printer properties (which can be changed using get/put), including the printer | |
22630 name, and contains a DEVMODE in it. if the printer name gets changed, the | |
22631 DEVMODE might change too, but the print-properties object itself stays the | |
22632 same. you pass this object as a parameter to the dialog box, and it gets | |
22633 changed accordingly. you can call something like set-device-print-properties to | |
22634 stick everything in this structure into the device. (you could imagine a case | |
22635 where someone wanted to keep multiple print configurations around ...) | |
22636 | |
22637 > | |
22638 > | |
22639 > Big K | |
22640 | |
22641 -- | |
22642 Ben | |
22643 | |
22644 @end example | |
22645 | |
22646 @node Discussion -- Multilingual Issues, Discussion -- Windows External Widget, Discussion -- Dialog Boxes, Future Work Discussion | |
22647 @section Discussion -- Multilingual Issues | |
22648 @cindex discussion, multilingual issues | |
22649 @cindex multilingual issues, discussion | |
22650 | |
22651 @example | |
22652 | |
22653 4/10/2000 4:13 AM | |
22654 | |
22655 BTW I am planning on adding some more powerful font-mapping capabilities to | |
22656 XEmacs (i.e. how do we map particular characters to the proper fonts that can | |
22657 display them, and how do we map the character's codes to the indices into the | |
22658 font). These will replace to hackish charset-registry/charset-ccl-program stuff | |
22659 we currently have, and be [a] much more powerful, [b] designed in a | |
22660 window-system-independent way, [c] works with specifiers so you can control the | |
22661 mapping of individual buffers, and [d] works on a character rather than charset | |
22662 level, to correctly handle Unicode. One possible usage would be to declare that | |
22663 all latin1 in a particular buffer to be displayed with latin2 fonts; I bet | |
22664 Hrvoje would really appreciate that | |
22665 | |
22666 --------------------------------------------------------------------------- | |
22667 | |
22668 April 10, 2000 | |
22669 | |
22670 [info from "creation of generic macros for accessing internally formatted data"] | |
22671 | |
22672 Hmm, so there I just wrote a detailed design for the macros. I would be | |
22673 @strong{THRILLED} and overjoyed if you went ahead and implemented this mechanism, or | |
22674 parts of it. | |
22675 | |
22676 I've just finished arranging for a new transcriptionist, and soon I should be | |
22677 able to send off and get back my dictation of my (a) exposing streams to lisp, | |
22678 and (b) allowing for proper lisp-created coding systems, which define their | |
22679 reading, writing, and detecting methods in lisp. | |
22680 | |
22681 | |
22682 BTW How's it going wrt your Unicode and decode-priority stuff? | |
22683 | |
22684 And ... you sent me mail asking what it was you had promised me, and listed | |
22685 only one thing, which was | |
22686 profiling of vm and certain other operations you found showed tremendous | |
22687 slowdown with Japanese characters. The other main thing I want from you is | |
22688 | |
22689 -- Your priorities, as an actual Japanese user and XEmacs developer, | |
22690 concerning what MULE work should be done, how it should be done, in what | |
22691 order, etc. | |
22692 | |
22693 I'm sure there's something else, but it's been awhile since I took my sleeping | |
22694 dose and my brain can barely function anymore. Just let me know how you're | |
22695 going to proceed with the above macro changes. | |
22696 | |
22697 BTW there's some nice Perl scripts written by Martin and fixed by me to make | |
22698 global-search-and-replace | |
22699 much, much easier. I've attached them. The first one is a shell script that | |
22700 works like | |
22701 | |
22702 gr foo bar *.[ch] | |
22703 | |
22704 and replaces foo with bar in all of the files. For each modified file, a | |
22705 backup is created in the backup/ directory, which is created as necessary. | |
22706 This shell script is a fairly trivial front end onto global-replace2, which is | |
22707 a perl script that takes one argument (a Perl expression such as s/foo/bar/g) | |
22708 and a list of files obtained by reading the stdin, and does the same global | |
22709 replacement. This means that the regexp syntax used here has to be perl-style | |
22710 rather than standard emacs/grep style. | |
22711 | |
22712 ben | |
22713 | |
22714 --------------------------------------------------------------------- | |
22715 | |
22716 | |
22717 From: | |
22718 Ben Wing <ben@@666.com> | |
22719 12/23/1999 3:34 AM | |
22720 | |
22721 Subject: | |
22722 Re: check process state before accessing coding_stream (fix PR#1061) | |
22723 To: | |
22724 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp> | |
22725 CC: | |
22726 XEmacs Developers <xemacs-beta@@xemacs.org> | |
22727 | |
22728 | |
22729 | |
22730 | |
22731 Thankfully, nearly all of this horridity you bring up is irrelevant. In | |
22732 XEmacs, "gettext" does not refer to any standard API, but is merely a stand-in | |
22733 for a translation routine (presumably written by us). We may as well call it | |
22734 something else. We define our own concept of "current language". We also | |
22735 allow for a function that needs a different version for each language, which | |
22736 handles all cases where simple translation isn't sufficient, e.g. when you | |
22737 have to pluralize some noun given to you or insert the correct form of the | |
22738 definite article. No weird hacks needed. No interaction problems with other | |
22739 pieces of software. | |
22740 | |
22741 What I wrote "awhile ago" is (unfortunately) not anywhere public currently, | |
22742 but it's on my list to put it on the web site. "There you go again" is | |
22743 usually not true; most of what I quote was indeed put out publicly at some | |
22744 point, but I'll try to be more explicit about this in the future. | |
22745 | |
22746 ben | |
22747 | |
22748 "Stephen J. Turnbull" wrote: | |
22749 | |
22750 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes: | |
22751 > | |
22752 > Ben> "Stephen J. Turnbull" wrote: | |
22753 > | |
22754 > >> What I have in mind is not just gettext-izing everything in the | |
22755 > >> XEmacs core sources. I currently believe that to be | |
22756 > >> unacceptable | |
22757 > | |
22758 > Ben> I don't quite understand. Could you elaborate and give some | |
22759 > Ben> examples? | |
22760 > | |
22761 > Examples? Hmm. | |
22762 > | |
22763 > First, there's the surface of Jan's y-or-n-p example. You have to | |
22764 > coordinate the translation of the message string and the response | |
22765 > prompt. This is handled by y-or-n-p itself (I see that we already do | |
22766 > have gettext for Emacs Lisp, that's nice to know). | |
22767 > | |
22768 > Except that it's not really handled by y-or-n-p. There's no reason to | |
22769 > suppose that somebody writing a Lisp package would necessarily use the | |
22770 > XEmacs domain (in fact, due to the way gettext binds text domains---if | |
22771 > I understand that correctly---we don't want that to be the case, | |
22772 > because it means that every time a Lisp package is updated the whole | |
22773 > XEmacs catalog must also be updated). So which domain gets used for | |
22774 > the message string? | |
22775 > | |
22776 > In the current implementation, it is the domain of y-or-n-p. So | |
22777 > packages with their own domain won't get y-or-n-p prompts correctly | |
22778 > translated. But that means that the package should do its own | |
22779 > translation. But now you're applying gettext to the same string | |
22780 > twice; you just have to pray the that translator upstream doesn't | |
22781 > collide with an English string that's in the XEmacs domain. (The | |
22782 > gettext docs mention the similar problem of English words with | |
22783 > multiple meanings that must map to different words in the target | |
22784 > language; this can be disambiguated by various trickeries in forming | |
22785 > the strings ... but only if you "own" them, which in the multi-domain, | |
22786 > interated gettext example you do not.) AFAICT this means that you | |
22787 > must never pass untranslated strings across public APIs, but this may | |
22788 > or may not be reasonable, and certainly is inconvenient. | |
22789 > | |
22790 > Next, we have to translate the possible answer strings to match the | |
22791 > language being passed by the user. This is presumably OK here, | |
22792 > because it's done by y-or-n-p. But what if y-or-n-p returned a string | |
22793 > rather than a boolean? Then we would need to coordinate the | |
22794 > presentation of the prompt (done by y-or-n-p) and the translation of | |
22795 > the possible answer strings (done by the caller). This can in fact be | |
22796 > done using dgettext with the XEmacs domain, but you must know that | |
22797 > y-or-n-p is in the XEmacs domain. This is not necessarily going to be | |
22798 > obvious, and it might very well be that sets of related packages might | |
22799 > have the same domain, so you wouldn't necessarily know which domain is | |
22800 > appropriate by looking at the requires. | |
22801 > | |
22802 > And what happens if one domain does supply translations for a language | |
22803 > and the other does not? AFAIK, gettext has no way to find out if this | |
22804 > is the case. But you might very will prefer a global fallback to | |
22805 > English if substantial phrases are drawn from both domains, while you | |
22806 > might prefer string-by-string fallback if the main text is translated | |
22807 > and only a few words are left to fallback to English. | |
22808 > | |
22809 > Aside from confusing users, this puts a great burden on programmers. | |
22810 > Programmers need to know about the status of the domains of packages | |
22811 > they use as well as the XEmacs domain; they need to program | |
22812 > defensively against the possibility that some package they use will | |
22813 > become gettext-ized, or the translation projects will be out of synch | |
22814 > (some teams will do the calling package first, others will do the | |
22815 > caller package first). | |
22816 > | |
22817 > I don't think anybody will use gettext in these circumstances. At | |
22818 > least not after they get the first bug report that "XEmacs is stuck in | |
22819 > an infinite y-or-n-p loop and I can't get out." | |
22820 > | |
22821 > Ben> I wrote this awhile ago: | |
22822 > | |
22823 > "There you go again." Not anywhere I could see it! (At least, it | |
22824 > doesn't look familiar and grepping the archives doesn't turn it up.) | |
22825 > | |
22826 > OK, you win. Subscribe me to xemacs-review. Or whatever seems | |
22827 > appropriate. | |
22828 > | |
22829 > -- | |
22830 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN | |
22831 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 | |
22832 > _________________ _________________ _________________ _________________ | |
22833 > What are those straight lines for? "XEmacs rules." | |
22834 | |
22835 -- | |
22836 In order to save my hands, I am cutting back on my responses, especially | |
22837 to XEmacs-related mail. You _will_ get a response, but please be patient. | |
22838 If you need an immediate response and it is not apparent in your message, | |
22839 please say so. Thanks for your understanding. | |
22840 | |
22841 | |
22842 | |
22843 -------------------------------------------------------------------- | |
22844 | |
22845 | |
22846 From: | |
22847 Ben Wing <ben@@666.com> | |
22848 12/21/1999 2:22 AM | |
22849 | |
22850 Subject: | |
22851 Re: check process state before accessing coding_stream (fix PR#1061) | |
22852 To: | |
22853 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp> | |
22854 CC: | |
22855 XEmacs Developers <xemacs-beta@@xemacs.org> | |
22856 | |
22857 | |
22858 | |
22859 | |
22860 | |
22861 "Stephen J. Turnbull" wrote: | |
22862 | |
22863 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes: | |
22864 > | |
22865 > Ben> Implementing message translation is not that hard. | |
22866 > | |
22867 > What I have in mind is not just gettext-izing everything in the XEmacs | |
22868 > core sources. I currently believe that to be unacceptable (see Jan's | |
22869 > message for the pitfalls in I18N; it's worse for M17N). I think | |
22870 > really solving this problem needs a specifier-like fallback mechanism | |
22871 > (this would solve Jan's example because you could query the | |
22872 > text-specifier presenting the question for the affirmative and | |
22873 > negative responses, and the catalog-building mechanism would have | |
22874 > checks to make sure they were properly set, perhaps a locale | |
22875 > (language) argument), and gettext is just not sufficient for that. | |
22876 | |
22877 I don't quite understand. Could you elaborate and give some examples? | |
22878 | |
22879 > | |
22880 > | |
22881 > At a minimum, we need to implement gettext for Lisp packages. | |
22882 > (Currently, gettext is only implemented for C AFAIK.) But this could | |
22883 > potentially cuase more trouble than it's worth. | |
22884 > | |
22885 > Ben> A lot depends on priority: How important do you think this | |
22886 > Ben> issue is to your average Japanese/Chinese/etc. user? | |
22887 > | |
22888 > Which average Japanese (etc) user? The English-skilled (relatively) | |
22889 > programmer in the free software movement, or my not-at-all-competent | |
22890 > undergrad students who I would love to have using an Emacs? This is a | |
22891 > really important ease-of-use issue. | |
22892 > | |
22893 > Realistically, for Japanese, it's low priority. The Japanese team in | |
22894 > the GNU Translation Project is doing very little AFAIK, so even if the | |
22895 > capability were there, I doubt the message catalog would soon be done. | |
22896 > | |
22897 > But I think that many non-English speakers would find it very | |
22898 > attractive, and for many languages there are well-organized and | |
22899 > productive translation teams. I suspect that if the I18N facility | |
22900 > were well-designed, many Western European languages would have full | |
22901 > catalogs within a year (granted, they are the ones where it's least | |
22902 > needed :-( ). | |
22903 > | |
22904 > Personally, I think doing it well is hard, and of little benefit to | |
22905 > _current_ core XEmacs constituency. I think doing a good job, with | |
22906 > catalogs, would be very attractive to many non-English-speaking | |
22907 > _potential_ users. | |
22908 > | |
22909 > Ben> How does it compare to some of the other important Mule | |
22910 > Ben> issues that Martin and I are (trying to work) on? | |
22911 > | |
22912 > I don't know what you guys are _trying_ to work on. Everything in the | |
22913 > I18N section of "Architecting XEmacs" is red-flagged. OTOH, it's | |
22914 > clear from your posts that you are overburdened, so I can't read | |
22915 > priority into the fact that you've responded to specific issues in the | |
22916 > past. | |
22917 | |
22918 I wrote this awhile ago: | |
22919 | |
22920 | |
22921 > | |
22922 > Ben> The big question is, would you be willing to help do the | |
22923 > Ben> actual implementation, to "be my hands"? | |
22924 > | |
22925 > Sure, subject to the usual caveat that I'd need to be convinced it's | |
22926 > worth doing and a secondary caveat that I am not an experienced coder. | |
22927 | |
22928 If you'll implement it, I'll design it. It's more a case of will on your part | |
22929 than anything else. I can give you instructions sufficient enough to match | |
22930 your level of expertise. | |
22931 | |
22932 ben | |
22933 | |
22934 > | |
22935 > | |
22936 > -- | |
22937 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN | |
22938 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 | |
22939 > _________________ _________________ _________________ _________________ | |
22940 > What are those straight lines for? "XEmacs rules." | |
22941 | |
22942 -- | |
22943 In order to save my hands, I am cutting back on my responses, especially | |
22944 to XEmacs-related mail. You _will_ get a response, but please be patient. | |
22945 If you need an immediate response and it is not apparent in your message, | |
22946 please say so. Thanks for your understanding. | |
22947 | |
22948 | |
22949 | |
22950 ----------------------------------------------------------------------------- | |
22951 | |
22952 Dec 20, 1999 | |
22953 | |
22954 | |
22955 Implementing message translation is not that hard. I've already done a lot of | |
22956 preliminary work in places such as @file{make-msgfile.lex} in lib-src/. Finishing up | |
22957 the work is not that big a task; I already know exactly how it should be | |
22958 done. Perhaps I'll write up detailed design instructions for this, as I'm | |
22959 doing for other things. A lot depends on priority: How important do you think | |
22960 this issue is to your average Japanese/Chinese/etc. user? How does it compare | |
22961 to some of the other important Mule issues that Martin and I are (trying to | |
22962 work) on? If I did the design document, would you be willing to do the | |
22963 necessary bit of C hackery to implement the document? If the design document | |
22964 is not specific enough for you, I can give you an "implementation document" | |
22965 which will definitely be specific enough: i.e. I'll show you exactly where the | |
22966 code needs to be modified, and how. The big question is, would you be willing | |
22967 to help do the actual implementation, to "be my hands"? | |
22968 | |
22969 --------------------------------------------------------------------------- | |
22970 | |
22971 From: | |
22972 Ben Wing <ben@@666.com> | |
22973 12/14/1999 11:00 PM | |
22974 | |
22975 Subject: | |
22976 Re: Mule UI disaster: displaying character tables | |
22977 To: | |
22978 Hrvoje Niksic <hniksic@@iskon.hr> | |
22979 CC: | |
22980 XEmacs vs Mule <xemacs-mule@@xemacs.org> | |
22981 | |
22982 | |
22983 | |
22984 | |
22985 What I mean is, please put my name in the header, as well as xemacs-mule. | |
22986 That way I'll see it in my personal box. | |
22987 | |
22988 I agree that Mule has problems, but: | |
22989 | |
22990 Brokenness can be fixed. | |
22991 Slowness can be fixed. | |
22992 Limitations can be fixed. | |
22993 | |
22994 The design limitation you mention below, for example, is not really very | |
22995 hard to change. | |
22996 | |
22997 Keep in mind that I pretty much rewrote Mule from scratch, and did it | |
22998 @strong{all} in 6-7 months. In comparison with that, the changes below are | |
22999 pretty minor, and each could be done by a good (and able-bodied!) | |
23000 programmer familiar with the Mule code in less than a week -- to the | |
23001 XEmacs code, at least. The problem is, everyone who could do this work is | |
23002 instead spending their time complaining about Mule problems instead of | |
23003 doing things. | |
23004 | |
23005 I'll gladly help out anyone who wants to do Mule coding by explaining all | |
23006 the details; I'll even write a "Mule internals manual", if that will | |
23007 help. I can also make international phone calls -- they're cheap here in | |
23008 the US due to the long distance wars. But so far no one has asked me for | |
23009 help or shown any willingness to do any work on Mule. | |
23010 | |
23011 Perhaps people are daunted by the seeming vastness of the problems. But I | |
23012 wager that if I had another 6 months to work on nothing but Mule, it would | |
23013 be nearly perfect. The basic design of the XEmacs C code is good; | |
23014 incremental changes, without over-much concern for compatibility, could | |
23015 make huge strides in a short amount of time (as was the case the whole | |
23016 time I worked on it, esp. towards the end -- it didn't even @strong{compile} for | |
23017 4 months!). A "total rewrite" would be an incredible waste of time. | |
23018 | |
23019 Again, I'm completely willing to provide help, documentation, design | |
23020 improvement suggestions (ala Architecting XEmacs -- which seems to have | |
23021 been completely ignored, alas), etc. | |
23022 | |
23023 ben | |
23024 | |
23025 Hrvoje Niksic wrote: | |
23026 | |
23027 > Ben Wing <ben@@666.com> writes: | |
23028 > | |
23029 > > I'm the one who did most of the Mule work in XEmacs, so if you have | |
23030 > > any questions about the core, please address them to me directly. I | |
23031 > > can probably give you a very clear and detailed answer. | |
23032 > | |
23033 > Thanks. I think it still makes sense to ask here, so that other | |
23034 > developer have a chance to chime in. | |
23035 > | |
23036 > > However, I need some explanation. What's misdesigned that you're | |
23037 > > complaining about? And what's the coding-system disaster? | |
23038 > | |
23039 > It's been spoken of a lot. Basically: | |
23040 > | |
23041 > * Unlike XEmacs/no-Mule, XEmacs/Mule doesn't preserve binary files in | |
23042 > Latin 2 locales by default. This is annoying for users who are used | |
23043 > to XEmacs/no-Mule. | |
23044 > | |
23045 > * XEmacs/Mule is much slower than XEmacs, and not only because of | |
23046 > character/byte conversions. It seems that font lookups etc. are | |
23047 > slower. | |
23048 > | |
23049 > * The "coding-system disaster" refers to inherent limitations of the | |
23050 > coding-system model. If I understand things correctly, | |
23051 > coding-systems convert streams of bytes to streams of Emchars. It | |
23052 > does not appear to be possible to create a "gzip" coding system for | |
23053 > handling gzipped file. Even EOL conversions look kludgish: | |
23054 > | |
23055 > iso-2022-8 | |
23056 > iso-2022-8-dos | |
23057 > iso-2022-8-mac | |
23058 > iso-2022-8-unix | |
23059 > iso-2022-8bit-ss2 | |
23060 > iso-2022-8bit-ss2-dos | |
23061 > iso-2022-8bit-ss2-mac | |
23062 > iso-2022-8bit-ss2-unix | |
23063 > iso-2022-int-1 | |
23064 > iso-2022-int-1-dos | |
23065 > iso-2022-int-1-mac | |
23066 > iso-2022-int-1-unix | |
23067 > | |
23068 > Ideally, it should be possible to specify a stream of | |
23069 > coding-systems, where only the last one converts to actual Emchars. | |
23070 > | |
23071 > There are more problems I don't remember right now. Many many usage | |
23072 > problems become apparent when I stand and look over the shoulders of | |
23073 > an XEmacs users who tries to use Mule. | |
23074 | |
23075 -- | |
23076 In order to save my hands, I am cutting back on my responses, especially | |
23077 to XEmacs-related mail. You _will_ get a response, but please be patient. | |
23078 | |
23079 If you need an immediate response and it is not apparent in your message, | |
23080 please say so. Thanks for your understanding. | |
23081 | |
23082 | |
23083 | |
23084 ----------------------------------------------------------------------- | |
23085 | |
23086 | |
23087 | |
23088 | |
23089 From: | |
23090 Ben Wing <ben@@666.com> | |
23091 12/14/1999 12:20 AM | |
23092 | |
23093 Subject: | |
23094 Re: Mule UI disaster: displaying character tables | |
23095 To: | |
23096 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp> | |
23097 CC: | |
23098 XEmacs vs Mule <xemacs-mule@@xemacs.org> | |
23099 | |
23100 | |
23101 | |
23102 | |
23103 I think you should go ahead with your proposal, and assume it will get | |
23104 implemented. I don't think Martin is really suggesting that API changes not | |
23105 be allowed, but just that they proceed in a somewhat orderly fashion; and in | |
23106 any case, I imagine I have final say in cases of Mule-related conflicts. | |
23107 | |
23108 ben | |
23109 | |
23110 "Stephen J. Turnbull" wrote: | |
23111 | |
23112 > >>>>> "Hrvoje" == Hrvoje Niksic <hniksic@@iskon.hr> writes: | |
23113 > | |
23114 > Hrvoje> So next I tried the "Mule" menu. That's right, boys and | |
23115 > Hrvoje> girls, I've never looked at it before. | |
23116 > | |
23117 > For quite a while, it didn't work at all, led to crashes and other | |
23118 > warm/fuzzy things. IIRC there used to be a top level menu item | |
23119 > pointing to information about the current language environment but it | |
23120 > got removed. | |
23121 > | |
23122 > Hrvoje> Wow. Seeing shift_jis, iso-2022 variants and (above all | |
23123 > Hrvoje> things) big5 makes me really warm and fuzzy. | |
23124 > | |
23125 > We've been through this recently---you were there. We know what to do | |
23126 > about it, basically (Ben liked my proposal, and it would fix this | |
23127 > silliness as well as the binary file breakage). But given that Ben | |
23128 > and Martin seem to have different ideas about where to go with Mule | |
23129 > (Ben seemed to be supporting API and implementation revisions, Martin | |
23130 > evidently wants to keep the current Mule), working on that proposal is | |
23131 > possibly a waste of time. I've got other stuff on my plate and I'll | |
23132 > get back to it one of these days (not tomorrow but sooner than Real | |
23133 > Soon Now). | |
23134 > | |
23135 > Hrvoje> The items it presents (leading to further submenus) are: | |
23136 > | |
23137 > Hrvoje> 94 character set | |
23138 > Hrvoje> 94 x 94 character set | |
23139 > Hrvoje> 96 character set | |
23140 > | |
23141 > This _is_ bad UI, now that you point it out. But it is quite natural | |
23142 > for a coding system lawyer (as all Japanese users have to be), I never | |
23143 > noticed it before. Easy enough to fix ("raise my karma"). | |
23144 > | |
23145 > Hrvoje> But I do bear some Mule scars, so I happily select "96 | |
23146 > Hrvoje> character sets", then ISO8859-2. And I get this: | |
23147 > | |
23148 > [Table omitted] | |
23149 > | |
23150 > Hrvoje> So me wonders: what the hell is this? | |
23151 > | |
23152 > Huh? That is the standard table that you see over and over again in | |
23153 > references. I'll believe you if you say you've never seen one before, | |
23154 > but every Japanese users' manual has dozens of pages of those, using | |
23155 > exactly that format. | |
23156 > | |
23157 > The presentation in the range 00--7F is not unreasonable for Latin 2; | |
23158 > ISO-8859 is a version of ISO-2022, so the high bit should not be | |
23159 > interpreted as "+ x80" (technically speaking), it should be | |
23160 > interpreted as a character set shift. | |
23161 > | |
23162 > Of course, this doesn't make sense to anybody but a character set | |
23163 > lawyer, and so should be changed. Especially since the header refers | |
23164 > to ISO-8859-2 which everybody these days thinks of as _one, 8-bit_ | |
23165 > character set, not two 7-bit ones. | |
23166 > | |
23167 > As for the "Japanese" in the table, that's just a really stupid | |
23168 > "optimization": those happen to be line-drawing characters available | |
23169 > in JIS X 0208, to make pretty borders. Substitute "-", "+", and "|" | |
23170 > in appropriate places to make ugly but portable borders. | |
23171 > | |
23172 > Hrvoje> Mule is just broken. Warn your friends. | |
23173 > | |
23174 > Hrvoje is on the rampage again. Warn your friends ;-) | |
23175 > | |
23176 > -- | |
23177 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN | |
23178 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 | |
23179 > _________________ _________________ _________________ _________________ | |
23180 > What are those straight lines for? "XEmacs rules." | |
23181 | |
23182 -- | |
23183 In order to save my hands, I am cutting back on my responses, especially | |
23184 to XEmacs-related mail. You _will_ get a response, but please be patient. | |
23185 If you need an immediate response and it is not apparent in your message, | |
23186 please say so. Thanks for your understanding. | |
23187 | |
23188 | |
23189 | |
23190 --------------------------------------------------------------------------- | |
23191 | |
23192 From: | |
23193 Ben Wing <ben@@666.com> | |
23194 12/14/1999 10:28 PM | |
23195 | |
23196 Subject: | |
23197 Re: Autodetect proposal; specifer questions/suggestions | |
23198 To: | |
23199 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp> | |
23200 | |
23201 | |
23202 | |
23203 | |
23204 I've always thought the specifier API is too complicated (and too | |
23205 "write-only"), but I went back at one point well after I designed it and I | |
23206 couldn't figure out an obvious way to simplify it that still kept reasonable | |
23207 functionality. Perhaps that's what Custom did, and why it turned out bad. | |
23208 | |
23209 Inefficiency is a stupid reason not to use them. They seem efficient enough | |
23210 for redisplay. Changing them might be inefficient, but Emacs Lisp is in | |
23211 general, right? | |
23212 | |
23213 Can you propose an API or functionality change that will make them more used? | |
23214 | |
23215 | |
23216 | |
23217 "Stephen J. Turnbull" wrote: | |
23218 | |
23219 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes: | |
23220 > | |
23221 > Ben> I think you should go ahead with your proposal, and assume it | |
23222 > Ben> will get implemented. | |
23223 > | |
23224 > OK. "yas baas" ;-) | |
23225 > | |
23226 > On something totally different. I'm really bothered by the fact that | |
23227 > specifiers are so little used (eg, Custom reimplements them badly), | |
23228 > and the fact that every package seems to define its own set of faces | |
23229 > (or whatever), rather than use the specifier mechanism to inherit from | |
23230 > existing ones, or add new specifications to existing ones. API problem? | |
23231 > | |
23232 > Also, faces (maybe specifiers in general?) should have an autoload | |
23233 > mechanism, and a @file{<package>-faces.el} (or @file{<package>-specifiers.el}) | |
23234 > convention. There are a number of faces in (eg) Custom that I like to | |
23235 > use, but I have to load Custom to get them. And Custom should be able | |
23236 > to somehow see all the faces in various packages available, even when | |
23237 > they are not loaded. | |
23238 > | |
23239 > I've seen claims that specifiers aren't very efficient. | |
23240 > | |
23241 > Opinions? | |
23242 > | |
23243 > -- | |
23244 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN | |
23245 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 | |
23246 > _________________ _________________ _________________ _________________ | |
23247 > What are those straight lines for? "XEmacs rules." | |
23248 | |
23249 -- | |
23250 In order to save my hands, I am cutting back on my responses, especially | |
23251 to XEmacs-related mail. You _will_ get a response, but please be patient. | |
23252 If you need an immediate response and it is not apparent in your message, | |
23253 please say so. Thanks for your understanding. | |
23254 | |
23255 | |
23256 ----------------------------------------------------------------------------- | |
23257 From: | |
23258 Ben Wing <ben@@666.com> | |
23259 11/18/1999 9:02 PM | |
23260 | |
23261 Subject: | |
23262 Re: Char-related crashes (hopefully) fixed | |
23263 To: | |
23264 "Stephen J. Turnbull" <turnbull@@sk.tsukuba.ac.jp> | |
23265 CC: | |
23266 XEmacs Beta List <xemacs-beta@@xemacs.org> | |
23267 | |
23268 | |
23269 | |
23270 | |
23271 OK, in summation: | |
23272 | |
23273 1. C-q is a user-level function and should do whatever makes the most sense. | |
23274 2. int-char is a low-level primitive and should never depend on high-level | |
23275 settings like language environment. | |
23276 3. Everything you can do with int-char can and should be done with make-char | |
23277 -- representation-independent, much less likelihood of bugs, etc. Therefore | |
23278 int-char should be removed. | |
23279 4. Note that CLTL2 also removes int-char. | |
23280 5. Your statement | |
23281 | |
23282 > In one-byte buffers (either Olivier's 1/2/4 extension or `xemacs -font | |
23283 > *-iso8859-2') it implicitly will have dependence whatever you say. | |
23284 | |
23285 is confusing internal and external representations. | |
23286 | |
23287 ben | |
23288 | |
23289 "Stephen J. Turnbull" wrote: | |
23290 | |
23291 > Can somebody give a bunch of examples where using integers as | |
23292 > characters is useful? For that matter, where they are actually used? | |
23293 > Ben said "backward compatibility," but I haven't seen this used, and I | |
23294 > don't really know how to grep for it. I have grepped for int-char, | |
23295 > int-to-char, char-int, and char-to-int and they're pretty rare in the | |
23296 > core and package code (2/3 of it) that I have. | |
23297 > | |
23298 > The only one that I ever use is the C-q hack for inserting characters | |
23299 > by code value at the keyboard, and that could arguably (and in | |
23300 > Japanese invariably is) delegated to an input method which would know | |
23301 > about language environment (and return a true character). | |
23302 > | |
23303 > For iterating over a character set in "natural" order, only ASCII | |
23304 > satisfies the requirement of having one, and even that's shaky. AFAIK | |
23305 > the Swedes and the Norwegians, or is it the Danes, disagree on | |
23306 > ordering the _letters_ in ISO-8859-1 character set. This really | |
23307 > should be table-driven, and will have to be for everything except | |
23308 > ASCII and ISO-8859-1 if we go to a Unicode internal representation. | |
23309 > | |
23310 > We already have primitives for efficient case conversion and the like. | |
23311 > | |
23312 > The only example I can think of offhand where you would really really | |
23313 > want the facility is to iterate over a code space where you don't know | |
23314 > which points are legal characters. Eg, to print out tables of fonts. | |
23315 > Pretty specialized. And this can be done through make-char, anyway. | |
23316 > | |
23317 > According to CLtL1, the main portable use for char-int is for hashing. | |
23318 > But that doesn't square with the kind of usage we've been talking | |
23319 > about (in loops and the like). | |
23320 > | |
23321 > What else am I missing? | |
23322 > | |
23323 > Ben's desiderata have some problems. | |
23324 > | |
23325 > >>>>> "Ben" == Ben Wing <ben@@666.com> writes: | |
23326 > | |
23327 > Ben> Either int-char should be the mirror opposite of char-int | |
23328 > Ben> (i.e. accept all legal char integers), or it should be | |
23329 > Ben> removed entirely. | |
23330 > | |
23331 > OK. I agree with this. | |
23332 > | |
23333 > Ben> int-char should @strong{never} have any dependence on the language | |
23334 > Ben> environment. | |
23335 > | |
23336 > In one-byte buffers (either Olivier's 1/2/4 extension or `xemacs -font | |
23337 > *-iso8859-2') it implicitly will have dependence whatever you say. | |
23338 > Even without Mule, people can always use external encoders to change | |
23339 > raw ISO-8859-2 to ISO-2022 (not that anybody sane ever would, OK, | |
23340 > Hrvoje?). Then the two files will be interpreted differently in a | |
23341 > Latin-1 locale Mule; the ISO-8859-2 file will be recognized as | |
23342 > ISO-8859-1, and the ISO-2022 file will be internally interpreted as | |
23343 > ISO-8859-2. | |
23344 > | |
23345 > The point is that people normally assume that int-char should accept | |
23346 > their "natural" integer to character map. For Americans, that's | |
23347 > ASCII, for Germans, that's ISO-8859-1, for Croatians, that's | |
23348 > ISO-8859-2. And it works "correctly" in a no-mule XEmacs with `-font | |
23349 > *-iso8859-2'! Japanese usually use ku-ten or JIS, and there's a | |
23350 > "natural" map from byte-sized integer pairs to shorts, but it's full | |
23351 > of holes. So language environments don't agree on what a legal char | |
23352 > integer is, and where they do (eg, ISO-8859-1 and ISO-8859-2), they | |
23353 > don't agree on the map. To satisfy your dictum (with which I agree, | |
23354 > but I take to mean we should get rid of these functions) we can take | |
23355 > the intersection where they agree | |
23356 > | |
23357 > ==> legal char integers == ASCII | |
23358 > | |
23359 > which is what I prefer, or pick something arbitrary and efficient | |
23360 > | |
23361 > ==> char-int returns the internal representation | |
23362 > | |
23363 > which I really hate, or something else. Suggestions? | |
23364 > | |
23365 > Ben> I don't think C-q should either. If Hrvoje wants to insert | |
23366 > Ben> Latin-2 characters by number, then make C-u C-q work so that | |
23367 > Ben> it also prompts for a character set, with a default chosen | |
23368 > Ben> from the language environment. | |
23369 > | |
23370 > And restrict this to ASCII? Or assume Latin-1 in GR if there is no | |
23371 > prefix argument? | |
23372 > | |
23373 > This is a useful feature. C-q currently inserts Latin-2 characters | |
23374 > for Hrvoje in no-mule XEmacs (stretching the point only a little); I | |
23375 > think it should continue to do so in Mule. This really is an input | |
23376 > method issue, not a keyboard issue. In XEmacs, inserting an integer | |
23377 > into a buffer has no meaning. Users insert characters. So this is a | |
23378 > completely different issue from the programming API, and should not be | |
23379 > considered analogous. | |
23380 > | |
23381 > Maybe we could have C-q insert according to the Unicode standard, and | |
23382 > treat C-u C-q as part of the input method. But I think most users | |
23383 > would prefer to have C-q insert according to their locale-standard | |
23384 > tables, and select Unicode explicitly using the C-u C-q idiom. In | |
23385 > fact (again this points to the input method idea), Japanese users | |
23386 > would probably like to have the alternatives of using kuten (pairs | |
23387 > from 1--94 x 1--94) or JIS (pairs from 0x21--0x7E x 0x21--0x7E) as | |
23388 > options since both indexing systems are common in tables. | |
23389 > | |
23390 > -- | |
23391 > University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN | |
23392 > Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 | |
23393 > __________________________________________________________________________ | |
23394 > __________________________________________________________________________ | |
23395 > What are those two straight lines for? "Free software rules." | |
23396 | |
23397 -- | |
23398 ben | |
23399 | |
23400 -- | |
23401 In order to save my hands, I am cutting back on my responses, especially to | |
23402 XEmacs-related mail. You | |
23403 _will_ get a response, but please be patient. If you need an immediate | |
23404 response and it’s not apparent in | |
23405 your message, please say so. Thanks for your understanding. | |
23406 | |
23407 | |
23408 | |
23409 ----------------------------------------------------------------------------- | |
23410 | |
23411 From: | |
23412 Ben Wing <ben@@666.com> | |
23413 11/16/1999 11:03 PM | |
23414 | |
23415 Subject: | |
23416 Re: Char-related crashes (hopefully) fixed | |
23417 To: | |
23418 Yoshiki Hayashi <t90553@@m.ecc.u-tokyo.ac.jp> | |
23419 CC: | |
23420 Hrvoje Niksic <hniksic@@iskon.hr>, | |
23421 XEmacs Beta List <xemacs-beta@@xemacs.org> | |
23422 | |
23423 | |
23424 | |
23425 | |
23426 Either int-char should be the mirror opposite of char-int (i.e. accept all | |
23427 legal char integers), or it should be removed entirely. | |
23428 | |
23429 int-char should @strong{never} have any dependence on the language environment. | |
23430 | |
23431 I don't think C-q should either. If Hrvoje wants to insert Latin-2 | |
23432 characters by number, then make C-u C-q work so that it also prompts for a | |
23433 character set, with a default chosen from the language environment. | |
23434 | |
23435 ben | |
23436 | |
23437 Yoshiki Hayashi wrote: | |
23438 | |
23439 > Hrvoje Niksic <hniksic@@iskon.hr> writes: | |
23440 > | |
23441 > > As Ben said, now that we've fixed the actual bugs, we can think about | |
23442 > > changing the behaviour for int-char conversions for 21.2. | |
23443 > | |
23444 > Following are proposed which integers should be accepted | |
23445 > where characters are expected: | |
23446 > | |
23447 > 1) Don't allow anything | |
23448 > 2) Accept 0-127 | |
23449 > 3) Accept 0-256 | |
23450 > 4) Accept everything | |
23451 > | |
23452 > Other things proposed are: | |
23453 > | |
23454 > a) When doing C-q, treat 128-256 as Latin-2 in Latin 2 | |
23455 > language environment. | |
23456 > | |
23457 > So far, most of the proposal is intended to apply to every | |
23458 > int-char conversions, I'd like to make some functions to | |
23459 > accept. | |
23460 > | |
23461 > My plan is: | |
23462 > Accept only 0-256 in every place except int-to-char. | |
23463 > int-to-char accepts every valid integers. | |
23464 > Make new function which does int-to-char conversion | |
23465 > correctly according to the language environment. | |
23466 > | |
23467 > This way, most of the code which does (insert (1+ ?a)) or | |
23468 > something continues working. Now internal representation is | |
23469 > changed a little bit, so disabling > 256 characters will | |
23470 > warn those who are dealing with internal representation | |
23471 > directly, which is bad. Still, you can do | |
23472 > (let ((i 1442)) | |
23473 > (while (i < 2000) | |
23474 > (insert (int-to-char i)) | |
23475 > (setq i (+1 i)))) | |
23476 > to achieve old behaviour. | |
23477 > | |
23478 > For C-q, I'm not for changing it's original definition, | |
23479 > since it might confuse people who are expecting Latin-1 in | |
23480 > other language environment and typing just 1 integer doesn't | |
23481 > make sense for multibyte world. It's cleaner to make new | |
23482 > function, which does make-char according to the charset of | |
23483 > language-info-alist so that people who use that often can | |
23484 > bind it to C-q or some other keys. | |
23485 > | |
23486 > -- | |
23487 > Yoshiki Hayashi | |
23488 | |
23489 -- | |
23490 ben | |
23491 | |
23492 -- | |
23493 In order to save my hands, I am cutting back on my responses, especially to | |
23494 XEmacs-related mail. You | |
23495 _will_ get a response, but please be patient. If you need an immediate | |
23496 response and it’s not apparent in | |
23497 your message, please say so. Thanks for your understanding. | |
23498 | |
23499 | |
23500 | |
23501 @end example | |
23502 | |
23503 @node Discussion -- Windows External Widget, Discussion -- Packages, Discussion -- Multilingual Issues, Future Work Discussion | |
23504 @section Discussion -- Windows External Widget | |
23505 @cindex discussion, windows external widget | |
23506 @cindex windows external widget, discussion | |
23507 | |
23508 @example | |
23509 | |
23510 Subject: | |
23511 Re: External Widget Support for Xemacs on nt | |
23512 Date: | |
23513 Sat, 08 Jul 2000 01:47:14 -0700 | |
23514 From: | |
23515 Ben Wing <ben@@666.com> | |
23516 To: | |
23517 Timothy.Fowler@@msdw.com | |
23518 CC: | |
23519 xemacs-nt@@xemacs.org | |
23520 References: | |
23521 1 | |
23522 | |
23523 | |
23524 | |
23525 | |
23526 Nothing is currently done for external widget support under XEmacs but it should | |
23527 not be too hard to do and would be a great addition to XEmacs. What you would | |
23528 probably want to do is create an XEmacs control that has an interface something | |
23529 like the built-in edit control and which communicates to an existing XEmacs | |
23530 process using DDE. (Basically you would modify XEmacs so that it registered | |
23531 itself as a DDE server accepting external widget requests, and then the external | |
23532 edit control would simply send a DDE request and the result would be a handle of | |
23533 some sort used for future communication with that particular XEmacs process.) | |
23534 | |
23535 There are two basic issues in getting the external widget to work, which are | |
23536 display and input. Although I am not completely sure, I have a feeling that it | |
23537 is possible for one process to write into the window of another process, simply | |
23538 by using that window's HWND handle. If so it should be extremely easy to get the | |
23539 output working (this is exactly the approach used under Xt). For input, you | |
23540 would probably again want to do what is done under Xt, which is that the client | |
23541 widget simply passes all of the appropriate messages to the XEmacs server | |
23542 process using whatever communication channel was set up, e.g. DDE, and the | |
23543 XEmacs server processes them normally. Very few modifications would be needed to | |
23544 the XEmacs source code and all of the necessary modifications could be done | |
23545 simply by looking for existing external widget code in XEmacs. | |
23546 | |
23547 If you are interested in continuing this, I will certainly give you any support | |
23548 you need along the way. This would be a great project to be added to XEmacs. | |
23549 | |
23550 | |
23551 | |
23552 Timothy Fowler wrote: | |
23553 | |
23554 > I am looking into external widget support for xemacs nt similar to that | |
23555 > existing in xemacs for X | |
23556 > Have any developement efforts been made in this direction in the past? | |
23557 > Is there any current effort? | |
23558 > Any insight into the complexity of achieving this? | |
23559 > Any comments would be greatly appreciated | |
23560 > Thanks | |
23561 > Tim Fowler | |
23562 | |
23563 -- | |
23564 Ben | |
23565 | |
23566 In order to save my hands, I am cutting back on my mail. I also write | |
23567 as succinctly as possible -- please don't be offended. If you send me | |
23568 mail, you _will_ get a response, but please be patient, especially for | |
23569 XEmacs-related mail. If you need an immediate response and it is not | |
23570 apparent in your message, please say so. Thanks for your understanding. | |
23571 | |
23572 See also http://www.666.com/ben/chronic-pain/ | |
23573 | |
23574 | |
23575 Subject: | |
23576 RE: External Widget Support for Xemacs on nt | |
23577 Date: | |
23578 Mon, 10 Jul 2000 12:40:01 +0100 | |
23579 From: | |
23580 "Alastair J. Houghton" <ajhoughton@@lineone.net> | |
23581 To: | |
23582 "Ben Wing" <ben@@666.com>, <xemacs-nt@@xemacs.org> | |
23583 CC: | |
23584 <Timothy.Fowler@@msdw.com> | |
23585 | |
23586 | |
23587 | |
23588 | |
23589 > -----Original Message----- | |
23590 > From: owner-xemacs-nt@@xemacs.org [mailto:owner-xemacs-nt@@xemacs.org]On | |
23591 > Behalf Of Ben Wing | |
23592 > Sent: 08 July 2000 09:47 | |
23593 > To: Timothy.Fowler@@msdw.com | |
23594 > Cc: xemacs-nt@@xemacs.org | |
23595 > Subject: Re: External Widget Support for Xemacs on nt | |
23596 > | |
23597 > Nothing is currently done for external widget support under | |
23598 > XEmacs but it should | |
23599 > not be too hard to do and would be a great addition to XEmacs. | |
23600 > What you would | |
23601 > probably want to do is create an XEmacs control that has an | |
23602 > interface something | |
23603 > like the built-in edit control and which communicates to an | |
23604 > existing XEmacs | |
23605 > process using DDE. | |
23606 | |
23607 It would be @strong{much} better to use RPC or COM rather than DDE - and | |
23608 also it would provide a more useful interface to XEmacs (like the | |
23609 Microsoft rich text edit control that is used by Wordpad). It | |
23610 would probably also be easier... | |
23611 | |
23612 > If you are interested in continuing this, I will certainly give | |
23613 > you any support | |
23614 > you need along the way. This would be a great project to be added | |
23615 > to XEmacs. | |
23616 | |
23617 I agree. This would be a *really useful* thing to do... | |
23618 | |
23619 Regards, | |
23620 | |
23621 Alastair. | |
23622 | |
23623 ____________________________________________________________ | |
23624 Alastair Houghton ajhoughton@@lineone.net | |
23625 | |
23626 Subject: | |
23627 Re: External Widget Support for Xemacs on nt | |
23628 Date: | |
23629 Mon, 10 Jul 2000 22:56:06 -0700 | |
23630 From: | |
23631 Ben Wing <ben@@666.com> | |
23632 To: | |
23633 "Alastair J. Houghton" <ajhoughton@@lineone.net> | |
23634 CC: | |
23635 xemacs-nt@@xemacs.org, Timothy.Fowler@@msdw.com | |
23636 References: | |
23637 1 | |
23638 | |
23639 | |
23640 | |
23641 | |
23642 sounds good. i don't know too much about windows ipc methods, so i suggested | |
23643 dde just as an example. | |
23644 | |
23645 "Alastair J. Houghton" wrote: | |
23646 | |
23647 > > -----Original Message----- | |
23648 > > From: owner-xemacs-nt@@xemacs.org [mailto:owner-xemacs-nt@@xemacs.org]On | |
23649 > > Behalf Of Ben Wing | |
23650 > > Sent: 08 July 2000 09:47 | |
23651 > > To: Timothy.Fowler@@msdw.com | |
23652 > > Cc: xemacs-nt@@xemacs.org | |
23653 > > Subject: Re: External Widget Support for Xemacs on nt | |
23654 > > | |
23655 > > Nothing is currently done for external widget support under | |
23656 > > XEmacs but it should | |
23657 > > not be too hard to do and would be a great addition to XEmacs. | |
23658 > > What you would | |
23659 > > probably want to do is create an XEmacs control that has an | |
23660 > > interface something | |
23661 > > like the built-in edit control and which communicates to an | |
23662 > > existing XEmacs | |
23663 > > process using DDE. | |
23664 > | |
23665 > It would be @strong{much} better to use RPC or COM rather than DDE - and | |
23666 > also it would provide a more useful interface to XEmacs (like the | |
23667 > Microsoft rich text edit control that is used by Wordpad). It | |
23668 > would probably also be easier... | |
23669 > | |
23670 > > If you are interested in continuing this, I will certainly give | |
23671 > > you any support | |
23672 > > you need along the way. This would be a great project to be added | |
23673 > > to XEmacs. | |
23674 > | |
23675 > I agree. This would be a *really useful* thing to do... | |
23676 > | |
23677 > Regards, | |
23678 > | |
23679 > Alastair. | |
23680 > | |
23681 > ____________________________________________________________ | |
23682 > Alastair Houghton ajhoughton@@lineone.net | |
23683 | |
23684 -- | |
23685 Ben | |
23686 | |
23687 In order to save my hands, I am cutting back on my mail. I also write | |
23688 as succinctly as possible -- please don't be offended. If you send me | |
23689 mail, you _will_ get a response, but please be patient, especially for | |
23690 XEmacs-related mail. If you need an immediate response and it is not | |
23691 apparent in your message, please say so. Thanks for your understanding. | |
23692 | |
23693 See also http://www.666.com/ben/chronic-pain/ | |
23694 | |
23695 @end example | |
23696 | |
23697 | |
23698 @node Discussion -- Packages, Discussion -- Distribution Layout, Discussion -- Windows External Widget, Future Work Discussion | |
23699 @section Discussion -- Packages | |
23700 @cindex discussion, packages | |
23701 @cindex packages, discussion | |
23702 | |
23703 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
23704 | |
23705 @subheading Important package-related changes | |
23706 | |
23707 This file details changes that make the package system no longer an | |
23708 unmitigated disaster. This way, at the very least, people can | |
23709 essentially ignore the package system and not get bitten horribly the | |
23710 way they currently do. | |
23711 | |
23712 @enumerate | |
23713 @item | |
23714 A single tarball containing absolutely everything and named | |
23715 xemacs-21.2.68.tar.gz. This must contain absolutely everything, | |
23716 including all of the packages, and in the proper directory | |
23717 structure, so that the paradigm for | |
23718 | |
23719 untar; configure; make; make install | |
23720 | |
23721 just works. | |
23722 | |
23723 @item | |
23724 Fixed startup slowdown when all packages are installed so that | |
23725 there is absolutely no penalty to having them all installed. This | |
23726 may be hard. | |
23727 | |
23728 @item | |
23729 All files on the ftp site should be accessible through http. | |
23730 | |
23731 @item | |
23732 Put symlinks into the distribution directory to the appropriate | |
23733 files in the package directory. | |
23734 | |
23735 @item | |
23736 Eliminate the confusing SUMO name, choosing a much more obvious | |
23737 name such as all-packages. | |
23738 | |
23739 @item | |
23740 There should be no separation of mule and non-mule packages. | |
23741 | |
23742 @item | |
23743 Having 2 packages that conflict with each other should be | |
23744 completely disallowed. | |
23745 | |
23746 @item | |
23747 Fix vc and ps-print so that there is only ONE version. | |
23748 | |
23749 @item | |
23750 Fix up all of the READMEs on the distribution site to make it | |
23751 abundantly clear what needs to be obtained, where to get it, and | |
23752 how to install it, especially with regards to packages. | |
23753 @end enumerate | |
23754 | |
23755 @node Discussion -- Distribution Layout, , Discussion -- Packages, Future Work Discussion | |
23756 @section Discussion -- Distribution Layout | |
23757 @cindex discussion, distribution layout | |
23758 @cindex distribution layout, discussion | |
23759 | |
23760 | |
23761 @example | |
23762 From: | |
23763 Ben Wing <ben@@666.com> | |
23764 10/15/1999 8:50 PM | |
23765 | |
23766 Subject: | |
23767 VOTE: Absolutely necessary changes to file naming in releases | |
23768 To: | |
23769 SL Baur <steve@@xemacs.org>, | |
23770 XEmacs Reviews <xemacs-review@@xemacs.org> | |
23771 | |
23772 | |
23773 | |
23774 | |
23775 Everybody except Steve seems to agree that we need to provide a single | |
23776 tar file containing the entire XEmacs tree whenever we release a new | |
23777 version of XEmacs (beta or not). Therefore I propose the following | |
23778 simple changes, and ask for a vote. If it is the general will of the | |
23779 developers, then Steve @strong{WILL} make these changes. This is the | |
23780 definition of cooperative development -- no one, not even the | |
23781 maintainer, can assert absolute power over anything. | |
23782 | |
23783 I propose (assuming, for example, release 21.2.20): | |
23784 | |
23785 1. xemacs-21.2.20.tar.gz -> xemacs-21.2.20-core.tar.gz | |
23786 | |
23787 2. xemacs-sumo.tar.gz -> xemacs-packages.tar.gz | |
23788 | |
23789 3. xemacs-mule-sumo.tar.gz -> xemacs-mule-packages.tar.gz | |
23790 | |
23791 4. Symlinks to the files mentioned in #2 and #3 get created in the SAME | |
23792 directory as xemacs-21.2.20-*.tar.gz. | |
23793 | |
23794 5. MOST IMPORTANTLY, a new file xemacs-21.2.20.tar.gz gets created, | |
23795 which is the combination of the 5 files xemacs-21.2.20-core.tar.gz, | |
23796 xemacs-21.2.20-elc.tar.gz, xemacs-21.2.20-info.tar.gz, | |
23797 xemacs-packages.tar.gz, and xemacs-mule-packages.tar.gz. | |
23798 | |
23799 | |
23800 The directory structure of the new combined file xemacs-21.2.20.tar.gz | |
23801 would look like this: | |
23802 | |
23803 xemacs-21.2.20/ | |
23804 xemacs-packages/ | |
23805 xemacs-mule-packages/ | |
23806 | |
23807 | |
23808 I am sorry to shout, but the current situation is just completely | |
23809 insane. | |
23810 | |
23811 ben | |
23812 | |
23813 | |
23814 | |
23815 | |
23816 | |
23817 | |
23818 From: | |
23819 Ben Wing <ben@@666.com> | |
23820 10/16/1999 3:12 AM | |
23821 | |
23822 Subject: | |
23823 Re: VOTE: Absolutely necessary changes to file naming in releases | |
23824 To: | |
23825 SL Baur <steve@@xemacs.org>, | |
23826 XEmacs Reviews <xemacs-review@@xemacs.org>, | |
23827 "Michael Sperber [Mr. Preprocessor]" <sperber@@informatik.uni-tuebingen.de> | |
23828 | |
23829 | |
23830 | |
23831 | |
23832 Something went wrong with my mail program while I was responding, so | |
23833 Michael's response is not quoted here. | |
23834 | |
23835 Let me rephrase my proposal, stressing the important points in order of | |
23836 importance: | |
23837 | |
23838 1. MOST IMPORTANT: There MUST be a SINGLE tar file containing the complete | |
23839 XEmacs sources, packages, etc. The name of this tar file must have a | |
23840 format like this: | |
23841 | |
23842 xemacs-21.2.10.tar.gz | |
23843 | |
23844 The directory layout of the packages within it is not important as long as | |
23845 it works: The user who downloads the tar file MUST be able to apply the | |
23846 'configure; make; make install' paradigm at the top-level directory and | |
23847 have it work properly. | |
23848 | |
23849 2. All the pieces of XEmacs must be in the @strong{same} subdirectory on the FTP | |
23850 site. | |
23851 | |
23852 3. The names need to be obvious and standard. Naming the core files | |
23853 "xemacs-21.2.20.tar.gz" is non-standard because those are only the core | |
23854 files. The standard followed by everybody in the world is that a name like | |
23855 this refers to the entire product, with all ancillary files. Also, "sumo", | |
23856 although a nice in-joke, is extremely confusing and needs to go. | |
23857 | |
23858 Referring to Michael's point about the layout I proposed, I also think that | |
23859 the package system needs to be modified to accept a layout produced by the | |
23860 "obvious" way of obtaining and untarring the parts, which leaves you with a | |
23861 directory consisting of | |
23862 | |
23863 xemacs-21.2.19/ | |
23864 xemacs-packages/ | |
23865 mule-packages/ | |
23866 | |
23867 All at the same level. However, this is an independent issue from the vote | |
23868 at hand. | |
23869 | |
23870 | |
23871 Consider the current insanity. The new XEmacs user or beta tester goes to | |
23872 the FTP site, looks around, finds the file xemacs-21.2.19.tar.gz, and | |
23873 downloads it, because it looks like the obvious one to get. But it doesn't | |
23874 work. Oops ... He looks some more and finds the other two -elc and -info | |
23875 parts, grabs them, and then tries again. But it still doesn't work. He | |
23876 manages to overhear something about packages, so he looks for them, but | |
23877 doesn't find them immediately (they're not even in the beta tree, though | |
23878 they obviously contain beta-level code, especially in xemacs-base and | |
23879 mule-base). Eventually he discovers the package/ subdirectory, but what | |
23880 the hell does he do there? There's no README at all there giving any | |
23881 clues, so he downloads everything. Along with this, he gets some files | |
23882 called "sumo", which he doesn't understand, but he notices that some of | |
23883 them are extremely large. "sumo" ... "large" ... hehe, I get it. Some | |
23884 silly developer's joke. But then he tries again to compile things, and | |
23885 just can't figure things out. He still doesn't know: | |
23886 | |
23887 -- "sumo" is not just some large file, but is a tar file of all the | |
23888 packages. | |
23889 -- The packages can't be placed is any subdirectory in any obvious relation | |
23890 to the XEmacs directory ("straight out of the box" if you manage to grok | |
23891 the significance of the sumo files, you get a layout like | |
23892 | |
23893 xemacs-21.2.19/ | |
23894 xemacs-packages/ | |
23895 mule-packages/ | |
23896 | |
23897 which naturally doesn't work! He needs to put them underneath | |
23898 xemacs-21.2.19/lib/xemacs/ or something.) | |
23899 | |
23900 At this point, he gives up, and (if he was a user of a pre-packagized | |
23901 XEmacs) wonders in despair how things got so messed up, when all older | |
23902 XEmacs releases, including all the betas, followed the standard "configure; | |
23903 make; make install" paradigm). | |
23904 | |
23905 | |
23906 | |
23907 Soooooo ......... PLEASE vote on issues #1-3 above, and add any comments | |
23908 you feel like adding. | |
23909 | |
23910 ben | |
23911 | |
23912 Ben Wing wrote: | |
23913 | |
23914 > Everybody except Steve seems to agree that we need to provide a single | |
23915 > tar file containing the entire XEmacs tree whenever we release a new | |
23916 > version of XEmacs (beta or not). Therefore I propose the following | |
23917 > simple changes, and ask for a vote. If it is the general will of the | |
23918 > developers, then Steve @strong{WILL} make these changes. This is the | |
23919 > definition of cooperative development -- no one, not even the | |
23920 > maintainer, can assert absolute power over anything. | |
23921 > | |
23922 > I propose (assuming, for example, release 21.2.20): | |
23923 > | |
23924 > 1. xemacs-21.2.20.tar.gz -> xemacs-21.2.20-core.tar.gz | |
23925 > | |
23926 > 2. xemacs-sumo.tar.gz -> xemacs-packages.tar.gz | |
23927 > | |
23928 > 3. xemacs-mule-sumo.tar.gz -> xemacs-mule-packages.tar.gz | |
23929 > | |
23930 > 4. Symlinks to the files mentioned in #2 and #3 get created in the SAME | |
23931 > directory as xemacs-21.2.20-*.tar.gz. | |
23932 > | |
23933 > 5. MOST IMPORTANTLY, a new file xemacs-21.2.20.tar.gz gets created, | |
23934 > which is the combination of the 5 files xemacs-21.2.20-core.tar.gz, | |
23935 > xemacs-21.2.20-elc.tar.gz, xemacs-21.2.20-info.tar.gz, | |
23936 > xemacs-packages.tar.gz, and xemacs-mule-packages.tar.gz. | |
23937 > | |
23938 > The directory structure of the new combined file xemacs-21.2.20.tar.gz | |
23939 > would look like this: | |
23940 > | |
23941 > xemacs-21.2.20/ | |
23942 > xemacs-packages/ | |
23943 > xemacs-mule-packages/ | |
23944 > | |
23945 > I am sorry to shout, but the current situation is just completely | |
23946 > insane. | |
23947 > | |
23948 > ben | |
23949 | |
23950 | |
23951 | |
23952 From: | |
23953 Ben Wing <ben@@666.com> | |
23954 12/6/1999 4:19 AM | |
23955 | |
23956 Subject: | |
23957 Re: Please Vote on Proposals | |
23958 To: | |
23959 Kyle Jones <kyle_jones@@wonderworks.com> | |
23960 CC: | |
23961 XEmacs Review <xemacs-review@@xemacs.org> | |
23962 | |
23963 | |
23964 | |
23965 | |
23966 OK Kyle, how about a different proposal: | |
23967 | |
23968 1. The distribution consists of the following three parts (let's assume | |
23969 v21.2.25): | |
23970 | |
23971 -- xemacs-21.2.25-core.tar.gz | |
23972 The same as would currently in xemacs-21.2.25.tar.gz. You can | |
23973 run this editor and edit in fundamental mode, but not do anything | |
23974 else. | |
23975 | |
23976 -- xemacs-21.2.25-core-packages.tar.gz | |
23977 A useful and complete subset of all the possible packages. Selection | |
23978 of | |
23979 what goes in and what goes out is based partially on consensus, | |
23980 partially | |
23981 on vote, and partially on these criteria: | |
23982 | |
23983 -- commonly-used packages go in. | |
23984 -- unmaintained or out-of-date packages go out. | |
23985 -- buggy, poorly-written packages go out. | |
23986 -- really obscure packages that hardly anybody could possibly care | |
23987 about go out. | |
23988 -- when there are two or three packages implementing basically the | |
23989 same functionality, pick only one to go in unless there are two | |
23990 that | |
23991 both are really commonly-used. | |
23992 -- if a package can be loaded implicitly as a result of something in | |
23993 the | |
23994 core, it needs to go in, regardless of whether it's been | |
23995 maintained. | |
23996 This applies, for example, to the mode files -- @strong{all} mode | |
23997 packages must | |
23998 go in (or more properly, every mode must have a corresponding | |
23999 package | |
24000 that's in, although if there are two or more packages implementing | |
24001 a | |
24002 particular mode, e.g. html, we are free to choose just one). | |
24003 | |
24004 -- xemacs-21.2.25-aux-packages.tar.gz | |
24005 All of the packages not in the previous file. Generally | |
24006 crappy-quality, | |
24007 poorly-maintained code. | |
24008 | |
24009 Note, we do not make distinctions between Mule and non-Mule in our | |
24010 packaging scheme -- this is a bug and XEmacs and/or the packages should | |
24011 be fixed up so that this goes away. | |
24012 | |
24013 2. The distribution also contains two combination files: | |
24014 | |
24015 -- xemacs-21.2.25.tar.gz | |
24016 This is the "default" file that a naive user ought to retrieve, and | |
24017 he'll get a running XEmacs, just like he wants, and comfortable, too, | |
24018 because all of the common packages are there. This file is a | |
24019 combination | |
24020 of xemacs-21.2.25-core.tar.gz and xemacs-21.2.25-core-packages.tar.gz. | |
24021 | |
24022 -- xemacs-21.2.25-everything.tar.gz | |
24023 This file contains absolutely everything, like it advertises -- | |
24024 including the aux packages and all of their associated crappy-quality, | |
24025 | |
24026 unmaintained code. This file is a combination of | |
24027 xemacs-21.2.25-core.tar.gz, | |
24028 xemacs-21.2.25-core-packages.tar.gz, and | |
24029 xemacs-21.2.25-aux-packages.tar.gz. | |
24030 | |
24031 | |
24032 I like this proposal better than the previous one I advocated, because it | |
24033 follows your good suggestion of separating the wheat from the chaff in | |
24034 the packages, so to speak. People will grab xemacs-21.2.25.tar.gz by | |
24035 default, just like they should, | |
24036 and they'll get something they're quite happy with, and we're happy | |
24037 because we can exercise quality control over the packages and exclude the | |
24038 crappy ones most likely to cause grief later on. | |
24039 | |
24040 | |
24041 What say y'all? | |
24042 | |
24043 ben | |
24044 | |
24045 | |
24046 | |
24047 Kyle Jones wrote: | |
24048 | |
24049 > Ben Wing writes: | |
24050 > > Disagree. Please let's follow everyone else's convention, and not | |
24051 > > introduce yet another randomness. | |
24052 > | |
24053 > It is not randomness! I think this is a semantic issue and an | |
24054 > important one. The issue is: What do we consider part of XEmacs | |
24055 > and what is considered external to XEmacs. If you put all the | |
24056 > packages in xemacs.tar.gz, then users can reasonably and wrongly | |
24057 > assume that all this random Lisp code is maintained by us. We | |
24058 > are trying to stay away from that model because in the past it has | |
24059 > left us with piles and piles of orphaned code. Even if every one | |
24060 > of us were paid to maintain XEmacs, it is just not practical for | |
24061 > us to continue to maintain all that code, let alone any new code. | |
24062 > So I think the naming distinction Jan is making is worth doing. | |
24063 > | |
24064 > Also, I don't consider the current situation broken, except | |
24065 > perhaps the sumo tarball being out of date. I never, ever, | |
24066 > though it was a great idea to ship all the stuff that XEacs | |
24067 > shipped in the old days. Because this pile of code was always | |
24068 > around in the distribution, an enormous web of undocumented | |
24069 > dependencies was constructed. Eventually, you HAD to install | |
24070 > everything because if you left something out or removed something | |
24071 > you never knew when XEmacs would throw an error. Thus the Cult | |
24072 > of the Cargo was born. | |
24073 > | |
24074 > One of the best things that came out of the package system was | |
24075 > the month or two we spent running XEmacs without all the assorted | |
24076 > Lisp installed. Dependencies were removed or documented, some | |
24077 > stuff got retired, and for the first time we actually had a full | |
24078 > accounting of what we were shipping. I currently run XEmacs with | |
24079 > 7 packages and I don't miss the other stuff. | |
24080 > | |
24081 > Having come this far, I do not think we should go back to | |
24082 > advocating that everyone just install everything and not | |
24083 > think about they are doing. Besides saving space and startup | |
24084 > time, another reason to not install everything is that you | |
24085 > won't bloat your XEmacs process nearly as much if you go | |
24086 > exploring in the Custom menus, because there won't be as much | |
24087 > Lisp loaded as Custom sets up its groups and whatnot. | |
24088 | |
24089 -- | |
24090 In order to save my hands, I am cutting back on my responses, especially | |
24091 to XEmacs-related mail. You _will_ get a response, but please be | |
24092 patient. | |
24093 If you need an immediate response and it is not apparent in your message, | |
24094 | |
24095 please say so. Thanks for your understanding. | |
20031 @end example | 24096 @end example |
20032 | 24097 |
20033 @node Old Future Work, Index, Future Work Discussion, Top | 24098 @node Old Future Work, Index, Future Work Discussion, Top |
20034 @chapter Old Future Work | 24099 @chapter Old Future Work |
20035 @cindex old future work | 24100 @cindex old future work |
20039 implemented. These proposals are included because they may describe to | 24104 implemented. These proposals are included because they may describe to |
20040 some extent the actual workings of the implemented code, and because | 24105 some extent the actual workings of the implemented code, and because |
20041 they may discuss relevant design issues, alternative implementations, or | 24106 they may discuss relevant design issues, alternative implementations, or |
20042 work still to be done. | 24107 work still to be done. |
20043 | 24108 |
20044 | |
20045 @menu | 24109 @menu |
20046 * Future Work -- A Portable Unexec Replacement:: | 24110 * Old Future Work -- A Portable Unexec Replacement:: |
20047 * Future Work -- Indirect Buffers:: | 24111 * Old Future Work -- Indirect Buffers:: |
20048 * Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: | 24112 * Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X:: |
20049 * Future Work -- xemacs.org Mailing Address Changes:: | 24113 * Old Future Work -- RTF Clipboard Support:: |
20050 * Future Work -- Lisp callbacks from critical areas of the C code:: | 24114 * Old Future Work -- xemacs.org Mailing Address Changes:: |
24115 * Old Future Work -- Lisp callbacks from critical areas of the C code:: | |
20051 @end menu | 24116 @end menu |
20052 | 24117 |
20053 @node Future Work -- A Portable Unexec Replacement, Future Work -- Indirect Buffers, Old Future Work, Old Future Work | 24118 @node Old Future Work -- A Portable Unexec Replacement, Old Future Work -- Indirect Buffers, Old Future Work, Old Future Work |
20054 @section Future Work -- A Portable Unexec Replacement | 24119 @section Old Future Work -- A Portable Unexec Replacement |
20055 @cindex future work, a portable unexec replacement | 24120 @cindex old future work, a portable unexec replacement |
20056 @cindex a portable unexec replacement, future work | 24121 @cindex a portable unexec replacement, old future work |
24122 | |
24123 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20057 | 24124 |
20058 @strong{Abstract:} Currently, during the build stage of XEmacs, a bare | 24125 @strong{Abstract:} Currently, during the build stage of XEmacs, a bare |
20059 version of the program (called @dfn{temacs}) is run, which loads up a | 24126 version of the program (called @dfn{temacs}) is run, which loads up a |
20060 bunch of Lisp data and then writes out a modified executable file. This | 24127 bunch of Lisp data and then writes out a modified executable file. This |
20061 process is very tricky to implement and highly system-dependent. It can | 24128 process is very tricky to implement and highly system-dependent. It can |
20179 preprocessor, or by simply using a different name, such as | 24246 preprocessor, or by simply using a different name, such as |
20180 @code{xmalloc}. It's also very important that we use the correct | 24247 @code{xmalloc}. It's also very important that we use the correct |
20181 @code{free} function when freeing dynamically-allocated data, depending | 24248 @code{free} function when freeing dynamically-allocated data, depending |
20182 on whether this data was allocated by us or by the | 24249 on whether this data was allocated by us or by the |
20183 | 24250 |
20184 @node Future Work -- Indirect Buffers, Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Future Work -- A Portable Unexec Replacement, Old Future Work | 24251 @node Old Future Work -- Indirect Buffers, Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work -- A Portable Unexec Replacement, Old Future Work |
20185 @section Future Work -- Indirect Buffers | 24252 @section Old Future Work -- Indirect Buffers |
20186 @cindex future work, indirect buffers | 24253 @cindex old future work, indirect buffers |
20187 @cindex indirect buffers, future work | 24254 @cindex indirect buffers, old future work |
24255 | |
24256 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20188 | 24257 |
20189 An indirect buffer is a buffer that shares its text with some other | 24258 An indirect buffer is a buffer that shares its text with some other |
20190 buffer, but has its own version of all of the buffer properties, | 24259 buffer, but has its own version of all of the buffer properties, |
20191 including markers, extents, buffer local variables, etc. Indirect | 24260 including markers, extents, buffer local variables, etc. Indirect |
20192 buffers are not currently implemented in XEmacs, but they are in GNU | 24261 buffers are not currently implemented in XEmacs, but they are in GNU |
20258 done only once, rather than on each buffer. I imagine it would be | 24327 done only once, rather than on each buffer. I imagine it would be |
20259 significantly easier to implement this, if a macro were created for | 24328 significantly easier to implement this, if a macro were created for |
20260 iterating over a buffer, and then all of the indirect children of that | 24329 iterating over a buffer, and then all of the indirect children of that |
20261 buffer. | 24330 buffer. |
20262 | 24331 |
20263 @node Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Future Work -- xemacs.org Mailing Address Changes, Future Work -- Indirect Buffers, Old Future Work | 24332 @node Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work -- RTF Clipboard Support, Old Future Work -- Indirect Buffers, Old Future Work |
20264 @section Future Work -- Improvements in support for non-ASCII (European) keysyms under X | 24333 @section Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X |
20265 @cindex future work, improvements in support for non-ascii (european) keysyms under x | 24334 @cindex old future work, improvements in support for non-ascii (european) keysyms under x |
20266 @cindex improvements in support for non-ascii (european) keysyms under x, future work | 24335 @cindex improvements in support for non-ascii (european) keysyms under x, old future work |
20267 | 24336 |
20268 From Martin Buchholz. | 24337 Author: @uref{mailto:martin@@xemacs.org,Martin Buchholz} |
20269 | 24338 |
20270 If a user has a keyboard with known standard non-ASCII character | 24339 If a user has a keyboard with known standard non-ASCII character |
20271 equivalents, typically for European users, then Emacs' default | 24340 equivalents, typically for European users, then Emacs' default |
20272 binding should be self-insert-command, with the obvious character | 24341 binding should be self-insert-command, with the obvious character |
20273 inserted. For example, if a user has a keyboard with | 24342 inserted. For example, if a user has a keyboard with |
20282 even be bound to anything by a user trying to customize it. | 24351 even be bound to anything by a user trying to customize it. |
20283 | 24352 |
20284 This is implemented by maintaining a table of translations between all | 24353 This is implemented by maintaining a table of translations between all |
20285 the known X keysym names and the corresponding (charset, octet) pairs. | 24354 the known X keysym names and the corresponding (charset, octet) pairs. |
20286 | 24355 |
24356 @quotation | |
20287 For every key on the keyboard that has a known character correspondence, | 24357 For every key on the keyboard that has a known character correspondence, |
20288 we define the ascii-character property of the keysym, and make the | 24358 we define the ascii-character property of the keysym, and make the |
20289 default binding for the key be self-insert-command. | 24359 default binding for the key be self-insert-command. |
20290 | 24360 |
20291 The following magic is basically intimate knowledge of X11/keysymdef.h. | 24361 The following magic is basically intimate knowledge of X11/keysymdef.h. |
20293 except for Cyrillic and Greek. | 24363 except for Cyrillic and Greek. |
20294 | 24364 |
20295 In a non-Mule world, a user can still have a multi-lingual editor, by doing | 24365 In a non-Mule world, a user can still have a multi-lingual editor, by doing |
20296 (set-face-font "...-iso8859-2" (current-buffer)) | 24366 (set-face-font "...-iso8859-2" (current-buffer)) |
20297 for all their Latin-2 buffers, etc. | 24367 for all their Latin-2 buffers, etc. |
20298 | 24368 @end quotation |
20299 @node Future Work -- xemacs.org Mailing Address Changes, Future Work -- Lisp callbacks from critical areas of the C code, Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work | 24369 |
20300 @section Future Work -- xemacs.org Mailing Address Changes | 24370 @node Old Future Work -- RTF Clipboard Support, Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work -- Improvements in support for non-ASCII (European) keysyms under X, Old Future Work |
20301 @cindex future work, xemacs.org mailing address changes | 24371 @section Old Future Work -- RTF Clipboard Support |
20302 @cindex xemacs.org mailing address changes, future work | 24372 @cindex old future work, RTF clipboard support |
24373 @cindex RTF clipboard support, old future work | |
24374 | |
24375 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
24376 | |
24377 in fact, i merged the windows stuff with the already-existing generic code. | |
24378 | |
24379 what i'd like to see is something like this: | |
24380 | |
24381 @enumerate | |
24382 @item | |
24383 The current function | |
24384 | |
24385 @example | |
24386 (defun own-selection (data &optional type append) | |
24387 @end example | |
24388 | |
24389 should become | |
24390 | |
24391 @example | |
24392 (defun own-selection (data &optional type how-to-add data-type) | |
24393 @end example | |
24394 | |
24395 where data-type is the mswindows format, and how-to-add is | |
24396 | |
24397 @example | |
24398 'replace-all or nil -- remove data for all formats | |
24399 'replace-existing -- remove data for DATA-TYPE, but leave other formats alone | |
24400 'append or t -- append data to existing data in DATA-TYPE, and leave other | |
24401 formats alone | |
24402 @end example | |
24403 | |
24404 @item | |
24405 the function | |
24406 | |
24407 @example | |
24408 (get-selection &optional TYPE DATA-TYPE) | |
24409 @end example | |
24410 | |
24411 already has a data-type so you don't need to change it. | |
24412 | |
24413 @item | |
24414 the existing function | |
24415 | |
24416 @example | |
24417 (selection-exists-p &optional SELECTION DEVICE) | |
24418 @end example | |
24419 | |
24420 should become | |
24421 | |
24422 @example | |
24423 (selection-exists-p &optional SELECTION DEVICE DATA-TYPE) | |
24424 @end example | |
24425 | |
24426 @item | |
24427 a new function | |
24428 | |
24429 @example | |
24430 (register-selection-data-type DATA-TYPE) | |
24431 @end example | |
24432 | |
24433 like your mswindows-register-clipboard-format. | |
24434 | |
24435 @item | |
24436 there's already a selection-converter-alist, but that's only for data out. | |
24437 you should alias it to selection-conversion-out-alist, and create | |
24438 selection-conversion-in-alist. these alists contain entries for CF_TEXT, which | |
24439 handles CR/LF conversion, and rtf, which does rtf in/out conversion -- no need | |
24440 for separate functions to do this. | |
24441 | |
24442 this may seem daunting, but it's much less hard to add stuff like this than it | |
24443 seems, and i and others will certainly give you lots of support if you run into | |
24444 problems. it would be way cool to have a more powerful clipboard mechanism in | |
24445 XEmacs. | |
24446 @end enumerate | |
24447 | |
24448 @node Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work -- Lisp callbacks from critical areas of the C code, Old Future Work -- RTF Clipboard Support, Old Future Work | |
24449 @section Old Future Work -- xemacs.org Mailing Address Changes | |
24450 @cindex old future work, xemacs.org mailing address changes | |
24451 @cindex xemacs.org mailing address changes, old future work | |
24452 | |
24453 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} | |
20303 | 24454 |
20304 @subheading Personal addresses | 24455 @subheading Personal addresses |
20305 | 24456 |
20306 @enumerate | 24457 @enumerate |
20307 @item | 24458 @item |
20380 addresses set up will make it much easier for this momentum to be built | 24531 addresses set up will make it much easier for this momentum to be built |
20381 up and to remain. | 24532 up and to remain. |
20382 | 24533 |
20383 @uref{../../www.666.com/ben/default.htm,Ben Wing} | 24534 @uref{../../www.666.com/ben/default.htm,Ben Wing} |
20384 | 24535 |
20385 @node Future Work -- Lisp callbacks from critical areas of the C code, , Future Work -- xemacs.org Mailing Address Changes, Old Future Work | 24536 @node Old Future Work -- Lisp callbacks from critical areas of the C code, , Old Future Work -- xemacs.org Mailing Address Changes, Old Future Work |
20386 @section Future Work -- Lisp callbacks from critical areas of the C code | 24537 @section Old Future Work -- Lisp callbacks from critical areas of the C code |
20387 @cindex future work, lisp callbacks from critical areas of the c code | 24538 @cindex old future work, lisp callbacks from critical areas of the c code |
20388 @cindex lisp callbacks from critical areas of the c code, future work | 24539 @cindex lisp callbacks from critical areas of the c code, old future work |
20389 | 24540 |
20390 @example | 24541 Author: @uref{mailto:ben@@xemacs.org,Ben Wing} |
24542 | |
20391 There are many places in the XEmacs C code where Lisp functions are | 24543 There are many places in the XEmacs C code where Lisp functions are |
20392 called, usually because the Lisp function is acting as a callback, | 24544 called, usually because the Lisp function is acting as a callback, |
20393 hook, process filter, or the like. The lisp code is often called in | 24545 hook, process filter, or the like. The lisp code is often called in |
20394 places where some lisp operations are dangerous. Currently there are | 24546 places where some lisp operations are dangerous. Currently there are |
20395 a lot of ad-hoc schemes implemented to try to prevent these dangerous | 24547 a lot of ad-hoc schemes implemented to try to prevent these dangerous |
20421 | 24573 |
20422 Corresponding to each of these entries is the C name of the bit flag. | 24574 Corresponding to each of these entries is the C name of the bit flag. |
20423 | 24575 |
20424 The sets of dangerous operations which can be prohibited are: | 24576 The sets of dangerous operations which can be prohibited are: |
20425 | 24577 |
20426 OPERATION_GC_PROHIBITED | 24578 @table @code |
20427 1. garbage collection. When this flag is set, and the garbage | 24579 @item OPERATION_GC_PROHIBITED |
20428 collection threshold is reached, garbage collection simply doesn't | 24580 garbage collection. When this flag is set, and the garbage |
20429 happen. It will happen at the next opportunity that it is allowed. | 24581 collection threshold is reached, garbage collection simply doesn't |
20430 Similarly, explicitly calling the Lisp function garbage-collect | 24582 happen. It will happen at the next opportunity that it is allowed. |
20431 simply does nothing. | 24583 Similarly, explicitly calling the Lisp function garbage-collect |
20432 | 24584 simply does nothing. |
20433 OPERATION_CATCH_ERRORS | 24585 |
20434 2. signalling an error. When @code{enter_sensitive_code_section()} is | 24586 @item OPERATION_CATCH_ERRORS |
20435 called, with the bit flag corresponding to this prohibited | 24587 signalling an error. When @code{enter_sensitive_code_section()} is |
20436 operation. When this bit flag is passed to | 24588 called, with the bit flag corresponding to this prohibited |
20437 @code{enter_sensitive_code_section()}, a catch is set up which catches all | 24589 operation. When this bit flag is passed to |
20438 errors, signals a warning with @code{warn_when_safe()}, and then simply | 24590 @code{enter_sensitive_code_section()}, a catch is set up which catches all |
20439 continues. This is exactly the same behavior you now get with the | 24591 errors, signals a warning with @code{warn_when_safe()}, and then simply |
20440 @code{call_*_trapping_errors()} functions. (there should also be some way | 24592 continues. This is exactly the same behavior you now get with the |
20441 of specifying a warning level and class here, similar to the | 24593 @code{call_*_trapping_errors()} functions. (there should also be some way |
20442 @code{call_*_trapping_errors()} functions. This is not completely | 24594 of specifying a warning level and class here, similar to the |
20443 important, however, because a standard warning level and class | 24595 @code{call_*_trapping_errors()} functions. This is not completely |
20444 could simply be chosen.) | 24596 important, however, because a standard warning level and class |
20445 | 24597 could simply be chosen.) |
20446 OPERATION_NO_UNSAFE_OBJECT_DELETION | 24598 |
20447 3. This flag prohibits deletion of any permanent object (i.e. any | 24599 @item OPERATION_NO_UNSAFE_OBJECT_DELETION |
20448 object that does not automatically disappear when created, such as | 24600 This flag prohibits deletion of any permanent object (i.e. any |
20449 buffers, frames, devices, windows, etc...) unless they were created | 24601 object that does not automatically disappear when created, such as |
20450 after this bit flag was set. This would be implemented using a | 24602 buffers, frames, devices, windows, etc...) unless they were created |
20451 list which stores all of the permanent objects created after this | 24603 after this bit flag was set. This would be implemented using a |
20452 bit flag was set. This list is reset to its previous value when | 24604 list which stores all of the permanent objects created after this |
20453 the call to @code{exit_sensitive_code_section()} occurs. The motivation | 24605 bit flag was set. This list is reset to its previous value when |
20454 here is to allow Lisp callbacks to create their own temporary | 24606 the call to @code{exit_sensitive_code_section()} occurs. The motivation |
20455 buffers or frames, and later delete them, but not allow any other | 24607 here is to allow Lisp callbacks to create their own temporary |
20456 permanent objects to be deleted, because C code might be working | 24608 buffers or frames, and later delete them, but not allow any other |
20457 with them, and not expect them to change. | 24609 permanent objects to be deleted, because C code might be working |
20458 | 24610 with them, and not expect them to change. |
20459 OPERATION_NO_BUFFER_MODIFICATION | 24611 |
20460 4. This flag disallows modifications to the text, extent or any other | 24612 @item OPERATION_NO_BUFFER_MODIFICATION |
20461 properties of any buffers except those created after this flag was | 24613 This flag disallows modifications to the text, extent or any other |
20462 set, just like in the previous entry. | 24614 properties of any buffers except those created after this flag was |
20463 | 24615 set, just like in the previous entry. |
20464 OPERATION_NO_REDISPLAY | 24616 |
20465 5. This bit flag inhibits any redisplay-related operations from | 24617 @item OPERATION_NO_REDISPLAY |
20466 happening, more specifically, any entry into the redisplay-related | 24618 This bit flag inhibits any redisplay-related operations from |
20467 code. This includes, for example, the Lisp functions sit-for, | 24619 happening, more specifically, any entry into the redisplay-related |
20468 force-redisplay, force-cursor-redisplay, window-end with certain | 24620 code. This includes, for example, the Lisp functions sit-for, |
20469 arguments to it, and various other functions. When this flag is | 24621 force-redisplay, force-cursor-redisplay, window-end with certain |
20470 set, instead of entering the redisplay code, the calling function | 24622 arguments to it, and various other functions. When this flag is |
20471 should simply make sure not to enter the redisplay code, (for | 24623 set, instead of entering the redisplay code, the calling function |
20472 example, in the case of window-end), or postpone the redisplay | 24624 should simply make sure not to enter the redisplay code, (for |
20473 until such a time when it's safe (for example, with sit-for and | 24625 example, in the case of window-end), or postpone the redisplay |
20474 force-redisplay). | 24626 until such a time when it's safe (for example, with sit-for and |
20475 | 24627 force-redisplay). |
20476 OPERATION_NO_REDISPLAY_SETTINGS_CHANGE | 24628 |
20477 6. This flag prohibits any modifications to faces, glyphs, specifiers, | 24629 @item OPERATION_NO_REDISPLAY_SETTINGS_CHANGE |
20478 extents, or any other settings that will affect the way that any | 24630 This flag prohibits any modifications to faces, glyphs, specifiers, |
20479 window is displayed. | 24631 extents, or any other settings that will affect the way that any |
20480 | 24632 window is displayed. |
24633 @end table | |
20481 | 24634 |
20482 The idea here is that it will finally be safe to call Lisp code from | 24635 The idea here is that it will finally be safe to call Lisp code from |
20483 nearly any part of the C code, simply by setting any combination of | 24636 nearly any part of the C code, simply by setting any combination of |
20484 restricted operation bit flags. This even includes from within | 24637 restricted operation bit flags. This even includes from within |
20485 redisplay. (in such a case, all of the bit flags need to be set). The | 24638 redisplay. (in such a case, all of the bit flags need to be set). The |
20486 reason that I thought of this is that some coding system translations | 24639 reason that I thought of this is that some coding system translations |
20487 might cause Lisp code to be invoked and C code often invokes these | 24640 might cause Lisp code to be invoked and C code often invokes these |
20488 translations in sensitive places. | 24641 translations in sensitive places. |
20489 @end example | |
20490 | 24642 |
20491 @c Indexing guidelines | 24643 @c Indexing guidelines |
20492 | 24644 |
20493 @c I assume that all indexes will be combined. | 24645 @c I assume that all indexes will be combined. |
20494 @c Therefore, if a generated findex and permutations | 24646 @c Therefore, if a generated findex and permutations |