comparison man/internals/internals.texi @ 5096:e0587c615e8b

Updates to internals.texi -------------------- ChangeLog entries follow: -------------------- man/ChangeLog addition: 2010-03-04 Ben Wing <ben@xemacs.org> * internals/internals.texi (Top): * internals/internals.texi (list-to-texinfo): Removed. * internals/internals.texi (convert-list-to-texinfo): New. * internals/internals.texi (table-to-texinfo): Removed. * internals/internals.texi (convert-table-to-texinfo): New. Update Lisp functions at top to newest versions. * internals/internals.texi (A History of Emacs): * internals/internals.texi (Through Version 18): * internals/internals.texi (Lucid Emacs): * internals/internals.texi (XEmacs): * internals/internals.texi (The XEmacs Split): * internals/internals.texi (Modules for Other Aspects of the Lisp Interpreter and Object System): * internals/internals.texi (Introduction to Writing C Code): * internals/internals.texi (Writing Good Comments): * internals/internals.texi (Writing Macros): * internals/internals.texi (Major Textual Changes): * internals/internals.texi (Great Integral Type Renaming): * internals/internals.texi (How to Regression-Test): * internals/internals.texi (Creating a Branch): * internals/internals.texi (Dynamic Arrays): * internals/internals.texi (Allocation by Blocks): * internals/internals.texi (mark_object): * internals/internals.texi (gc_sweep): * internals/internals.texi (Byte-Char Position Conversion): * internals/internals.texi (Searching and Matching): * internals/internals.texi (Introduction to Multilingual Issues #3): * internals/internals.texi (Byte Types): * internals/internals.texi (Different Ways of Seeing Internal Text): * internals/internals.texi (Buffer Positions): * internals/internals.texi (Basic internal-format APIs): * internals/internals.texi (The DFC API): * internals/internals.texi (General Guidelines for Writing Mule-Aware Code): * internals/internals.texi (Mule-izing Code): * internals/internals.texi (Locales): * internals/internals.texi (More about code pages): * internals/internals.texi (More about locales): * internals/internals.texi (Unicode support under Windows): * internals/internals.texi (The Frame): * internals/internals.texi (The Non-Client Area): * internals/internals.texi (The Client Area): * internals/internals.texi (The Paned Area): * internals/internals.texi (Text Areas): * internals/internals.texi (The Displayable Area): * internals/internals.texi (Event Queues): * internals/internals.texi (Event Stream Callback Routines): * internals/internals.texi (Focus Handling): * internals/internals.texi (Future Work -- Autodetection): Replace " with ``, '' (not complete, maybe about halfway through).
author Ben Wing <ben@xemacs.org>
date Thu, 04 Mar 2010 07:19:03 -0600
parents 0ca81354c4c7
children 7be849cb8828
comparison
equal deleted inserted replaced
5095:cb4f2e1bacc4 5096:e0587c615e8b
159 that has been formatted into ASCII lists and tables. 159 that has been formatted into ASCII lists and tables.
160 160
161 Note: to define these routines, put point after the end of the definition 161 Note: to define these routines, put point after the end of the definition
162 and type C-x C-e. 162 and type C-x C-e.
163 163
164 (defun list-to-texinfo (b e) 164 (defun convert-list-to-texinfo (b e)
165 "Convert the selected region from an ASCII list to a Texinfo list." 165 "Convert the selected region from an ASCII list to a Texinfo list."
166 (interactive "r") 166 (interactive "r")
167 (save-restriction 167 (save-restriction
168 (narrow-to-region b e) 168 (narrow-to-region b e)
169 (goto-char (point-min)) 169 (goto-char (point-min))
170 (let ((dash-type "^ *-+ +") 170 (let ((dash-type "^ *\\(-+\\|o\\) +")
171 ;; allow single-letter numbering or roman numerals 171 ;; allow single-letter numbering or roman numerals
172 (letter-type "^ *[[(]?\\([a-zA-Z]\\|[IVXivx]+\\)[]).] +") 172 (letter-type "^ *[[(]?\\([a-zA-Z]\\|[IVXivx]+\\)[]).] +")
173 (num-type "^ *[[(]?[0-9]+[]).] +") 173 (num-type "^ *[[(]?[0-9]+[]).] +")
174 dash regexp) 174 dash regexp)
175 (save-excursion 175 (save-excursion
237 (insert-char ?\ (- min (current-column))) 237 (insert-char ?\ (- min (current-column)))
238 (beginning-of-line) 238 (beginning-of-line)
239 (forward-char min)) 239 (forward-char min))
240 (kill-rectangle b (point)))))) 240 (kill-rectangle b (point))))))
241 241
242 (defun table-to-texinfo (b e) 242 (defun convert-table-to-texinfo (b e)
243 "Convert the selected region from an ASCII table to a Texinfo table. 243 "Convert the selected region from an ASCII table to a Texinfo table.
244 Assumes entries are separated by a blank line, and the first sexp in 244 Assumes entries are separated by a blank line, and the first sexp in
245 each entry is the table heading." 245 each entry is the table heading."
246 (interactive "r") 246 (interactive "r")
247 (save-restriction 247 (save-restriction
281 If the region is active, do the region; otherwise, go from point to the end 281 If the region is active, do the region; otherwise, go from point to the end
282 of the buffer. This query-replaces for various kinds of conventions used 282 of the buffer. This query-replaces for various kinds of conventions used
283 in text: @code{} surrounded by ` and ' or followed by a (); @strong{} 283 in text: @code{} surrounded by ` and ' or followed by a (); @strong{}
284 surrounded by *'s; @file{} something that looks like a file name." 284 surrounded by *'s; @file{} something that looks like a file name."
285 (interactive) 285 (interactive)
286 (if (and (not no-narrow) (region-active-p)) 286 (save-excursion
287 (save-restriction 287 (if (and (not no-narrow) (region-active-p))
288 (narrow-to-region (region-beginning) (region-end)) 288 (save-restriction
289 (convert-text-to-texinfo t)) 289 (narrow-to-region (region-beginning) (region-end))
290 (let ((p (point)) 290 (goto-char (region-beginning))
291 (case-replace nil)) 291 (zmacs-deactivate-region)
292 (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil) 292 (convert-text-to-texinfo t))
293 (goto-char p) 293 (let ((p (point))
294 (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil) 294 (case-replace nil))
295 (goto-char p) 295 (message "Point is %d" (point))
296 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil) 296 (query-replace-regexp "`\\([^']+\\)'\\([^']\\)" "@code{\\1}\\2" nil)
297 (goto-char p) 297 (goto-char p)
298 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil) 298 (query-replace-regexp "\\(\\Sw\\)\\*\\(\\(?:\\s_\\|\\sw\\)+\\)\\*\\([^A-Za-z.}]\\)" "\\1@strong{\\2}\\3" nil)
299 ))) 299 (goto-char p)
300 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+()\\)\\([^}]\\)" "@code{\\1}\\3" nil)
301 (goto-char p)
302 (query-replace-regexp "\\(\\(\\s_\\|\\sw\\)+\\.[A-Za-z]+\\)\\([^A-Za-z.}]\\)" "@file{\\1}\\3" nil)
303 ))))
300 304
301 4. Adding new sections: 305 4. Adding new sections:
302 ----------------------- 306 -----------------------
303 307
304 NOTE: These are in the form of macros. #### FIXME Convert them to 308 NOTE: These are in the form of macros. #### FIXME Convert them to
1236 XEmacs is a powerful, customizable text editor and development 1240 XEmacs is a powerful, customizable text editor and development
1237 environment. It began in 1991 as Lucid Emacs, which was in turn 1241 environment. It began in 1991 as Lucid Emacs, which was in turn
1238 derived from GNU Emacs, a program written by Richard Stallman of the 1242 derived from GNU Emacs, a program written by Richard Stallman of the
1239 Free Software Foundation. GNU Emacs dates back to 1985 and was 1243 Free Software Foundation. GNU Emacs dates back to 1985 and was
1240 modelled after Unipress Emacs, an editor written by James Gosling in 1244 modelled after Unipress Emacs, an editor written by James Gosling in
1241 1981 and based on a series of other "Emacs"-like editors, including 1245 1981 and based on a series of other ``Emacs''-like editors, including
1242 EINE (EINE Is Not EMACS), c. 1976, by Dan Weinreb, which run on the 1246 EINE (EINE Is Not EMACS), c. 1976, by Dan Weinreb, which run on the
1243 MIT Lisp Machine and was the first Emacs written in Lisp; ZWEI (ZWEI 1247 MIT Lisp Machine and was the first Emacs written in Lisp; ZWEI (ZWEI
1244 Was EINE Initially), c. 1978, by Dan Weinreb and Mike McMahon; Multics 1248 Was EINE Initially), c. 1978, by Dan Weinreb and Mike McMahon; Multics
1245 Emacs, c. 1978, by Bernie Greenberg, which was written in MacLisp and 1249 Emacs, c. 1978, by Bernie Greenberg, which was written in MacLisp and
1246 also used Lisp as its extension language; and ZMACS, c. 1980, a direct 1250 also used Lisp as its extension language; and ZMACS, c. 1980, a direct
1247 descendant of ZWEI that on ran the Symbolics LM-2, LMI LispM, and 1251 descendant of ZWEI that on ran the Symbolics LM-2, LMI LispM, and
1248 later, TI Explorer (1983-1989). These in turn were inspired by the 1252 later, TI Explorer (1983-1989). These in turn were inspired by the
1249 first Emacs, a package called EMACS, written in 1976 by Richard 1253 first Emacs, a package called EMACS, written in 1976 by Richard
1250 Stallman, Guy Steele, and Dave Moon. This was a merger of TECMAC and 1254 Stallman, Guy Steele, and Dave Moon. This was a merger of TECMAC and
1251 TMACS, a pair of "TECO-macro realtime editors" written by Guy Steele, 1255 TMACS, a pair of ``TECO-macro realtime editors'' written by Guy Steele,
1252 Dave Moon, Richard Greenblatt, Charles Frankston, et al., and added a 1256 Dave Moon, Richard Greenblatt, Charles Frankston, et al., and added a
1253 dynamic loader and Meta-key cmds. It ran under ITS (the Incompatible 1257 dynamic loader and Meta-key cmds. It ran under ITS (the Incompatible
1254 Timesharing System) on a DEC PDP 10 and under TWENEX on a Tops-20 and 1258 Timesharing System) on a DEC PDP 10 and under TWENEX on a Tops-20 and
1255 was written in TECO and PDP 10 assembly. ITS was one of the first 1259 was written in TECO and PDP 10 assembly. ITS was one of the first
1256 time-sharing operating systems and dates back well before Unix. ITS, 1260 time-sharing operating systems and dates back well before Unix. ITS,
1284 M. Stallman (RMS) and James Gosling (the creator of Java); its extension 1288 M. Stallman (RMS) and James Gosling (the creator of Java); its extension
1285 language was known as @dfn{Mocklisp}. This version of Emacs-in-C formed 1289 language was known as @dfn{Mocklisp}. This version of Emacs-in-C formed
1286 the basis for the early versions of GNU Emacs and also for Gosling's 1290 the basis for the early versions of GNU Emacs and also for Gosling's
1287 Unipress Emacs, a commercial product. Because of bad blood between the 1291 Unipress Emacs, a commercial product. Because of bad blood between the
1288 two over the issue of commercialism, RMS pretty much disowned this 1292 two over the issue of commercialism, RMS pretty much disowned this
1289 collaboration, referring to it as "Gosling Emacs". 1293 collaboration, referring to it as ``Gosling Emacs''.
1290 1294
1291 At this point we pick up with a time line of events. (A broader timeline 1295 At this point we pick up with a time line of events. (A broader timeline
1292 is available at @uref{http://www.jwz.org/doc/emacs-timeline.html, 1296 is available at @uref{http://www.jwz.org/doc/emacs-timeline.html,
1293 ``Emacs Timeline''}.) 1297 ``Emacs Timeline''}.)
1294 1298
1575 redisplay code, preliminary I18N support, code merged from GNU Emacs 1579 redisplay code, preliminary I18N support, code merged from GNU Emacs
1576 19.8 beta) 1580 19.8 beta)
1577 @item 1581 @item
1578 Version 19.9 released January 12, 1994. (Scrollbars, Athena.) 1582 Version 19.9 released January 12, 1994. (Scrollbars, Athena.)
1579 @item 1583 @item
1580 Version 19.10 released May 27, 1994. (Uses `configure'; code merged 1584 Version 19.10 released May 27, 1994. (Uses @code{configure}; code merged
1581 from GNU Emacs 19.23 beta and further merging with Epoch 4.0) Known as 1585 from GNU Emacs 19.23 beta and further merging with Epoch 4.0) Known as
1582 "Lucid Emacs" when shipped by Lucid, and as "XEmacs" when shipped by 1586 ``Lucid Emacs'' when shipped by Lucid, and as ``XEmacs'' when shipped by
1583 Sun; but Lucid went out of business a few days later and it's unclear 1587 Sun; but Lucid went out of business a few days later and it's unclear
1584 very many copies of 19.10 were released by Lucid. (Last release by 1588 very many copies of 19.10 were released by Lucid. (Last release by
1585 Jamie Zawinski.) 1589 Jamie Zawinski.)
1586 @end itemize 1590 @end itemize
1587 1591
1887 rewritten redisplay, TTY support, multi-device support, device and 1891 rewritten redisplay, TTY support, multi-device support, device and
1888 console objects, specifiers, glyphs, toolbars, horizontal scrollbars, 1892 console objects, specifiers, glyphs, toolbars, horizontal scrollbars,
1889 Lucid scrollbar widget, 3-d modeline, stay-up Lucid menus, resizable 1893 Lucid scrollbar widget, 3-d modeline, stay-up Lucid menus, resizable
1890 minibuffer, echo area is a true buffer, MD5 hashing support, expanded 1894 minibuffer, echo area is a true buffer, MD5 hashing support, expanded
1891 menubar, redone menu specification format (including menu filters), 1895 menubar, redone menu specification format (including menu filters),
1892 rewritten extents, renamed "screen" to "frame", misc-user events, 1896 rewritten extents, renamed ``screen'' to ``frame'', misc-user events,
1893 rewritten face code, rewritten mouse code, warnings system, CL 1897 rewritten face code, rewritten mouse code, warnings system, CL
1894 backquote syntax, critical C-g, code merging with GNU Emacs 19.28. 1898 backquote syntax, critical C-g, code merging with GNU Emacs 19.28.
1895 New packages Hyperbole, OOBR, hm--html-menus, viper, lazy-lock, 1899 New packages Hyperbole, OOBR, hm--html-menus, viper, lazy-lock,
1896 ksh-mode, rsz-minibuf.) 1900 ksh-mode, rsz-minibuf.)
1897 @item 1901 @item
1935 version 20.4 released February 28, 1998. 1939 version 20.4 released February 28, 1998.
1936 @item 1940 @item
1937 version 21.0.60 released December 10, 1998. (The version naming scheme was 1941 version 21.0.60 released December 10, 1998. (The version naming scheme was
1938 changed at this point: [a] the second version number is odd for stable 1942 changed at this point: [a] the second version number is odd for stable
1939 versions, even for beta versions; [b] a third version number is added, 1943 versions, even for beta versions; [b] a third version number is added,
1940 replacing the "beta xxx" ending for beta versions and allowing for 1944 replacing the ``beta xxx'' ending for beta versions and allowing for
1941 periodic maintenance releases for stable versions. Therefore, 21.0 was 1945 periodic maintenance releases for stable versions. Therefore, 21.0 was
1942 never "officially" released; similarly for 21.2, etc.) 1946 never ``officially'' released; similarly for 21.2, etc.)
1943 @item 1947 @item
1944 version 21.0.61 released January 4, 1999. 1948 version 21.0.61 released January 4, 1999.
1945 @item 1949 @item
1946 version 21.0.63 released February 3, 1999. 1950 version 21.0.63 released February 3, 1999.
1947 @item 1951 @item
1953 @item 1957 @item
1954 version 21.0.67 released March 25, 1999. 1958 version 21.0.67 released March 25, 1999.
1955 @item 1959 @item
1956 version 21.1.2 released May 14, 1999. (This is the followup to 21.0.67. 1960 version 21.1.2 released May 14, 1999. (This is the followup to 21.0.67.
1957 The second version number was bumped to indicate the beginning of the 1961 The second version number was bumped to indicate the beginning of the
1958 "stable" series.) 1962 ``stable'' series.)
1959 @item 1963 @item
1960 version 21.1.3 released June 26, 1999. 1964 version 21.1.3 released June 26, 1999.
1961 @item 1965 @item
1962 version 21.1.4 released July 8, 1999. 1966 version 21.1.4 released July 8, 1999.
1963 @item 1967 @item
2043 @item 2047 @item
2044 version 21.2.39 released December 31, 2000. 2048 version 21.2.39 released December 31, 2000.
2045 @item 2049 @item
2046 version 21.2.40 released January 8, 2001. 2050 version 21.2.40 released January 8, 2001.
2047 @item 2051 @item
2048 version 21.2.41 "Polyhymnia" released January 17, 2001. 2052 version 21.2.41 ``Polyhymnia'' released January 17, 2001.
2049 @item 2053 @item
2050 version 21.2.42 "Poseidon" released January 20, 2001. 2054 version 21.2.42 ``Poseidon'' released January 20, 2001.
2051 @item 2055 @item
2052 version 21.2.43 "Terspichore" released January 26, 2001. 2056 version 21.2.43 ``Terspichore'' released January 26, 2001.
2053 @item 2057 @item
2054 version 21.2.44 "Thalia" released February 8, 2001. 2058 version 21.2.44 ``Thalia'' released February 8, 2001.
2055 @item 2059 @item
2056 version 21.2.45 "Thelxepeia" released February 23, 2001. 2060 version 21.2.45 ``Thelxepeia'' released February 23, 2001.
2057 @item 2061 @item
2058 version 21.2.46 "Urania" released March 21, 2001. 2062 version 21.2.46 ``Urania'' released March 21, 2001.
2059 @item 2063 @item
2060 version 21.2.47 "Zephir" released April 14, 2001. 2064 version 21.2.47 ``Zephir'' released April 14, 2001.
2061 @item 2065 @item
2062 XEmacs 21.4.0 "Solid Vapor" released April 16, 2001. 2066 XEmacs 21.4.0 ``Solid Vapor'' released April 16, 2001.
2063 @item 2067 @item
2064 XEmacs 21.4.1 "Copyleft" released April 19, 2001. 2068 XEmacs 21.4.1 ``Copyleft'' released April 19, 2001.
2065 @item 2069 @item
2066 XEmacs 21.4.2 "Developer-Friendly Unix APIs" released May 10, 2001. 2070 XEmacs 21.4.2 ``Developer-Friendly Unix APIs'' released May 10, 2001.
2067 @item 2071 @item
2068 XEmacs 21.4.3 "Academic Rigor" released May 17, 2001. 2072 XEmacs 21.4.3 ``Academic Rigor'' released May 17, 2001.
2069 @item 2073 @item
2070 XEmacs 21.4.4 "Artificial Intelligence" released July 28, 2001. 2074 XEmacs 21.4.4 ``Artificial Intelligence'' released July 28, 2001.
2071 @item 2075 @item
2072 XEmacs 21.4.5 "Civil Service" released October 23, 2001. 2076 XEmacs 21.4.5 ``Civil Service'' released October 23, 2001.
2073 @item 2077 @item
2074 XEmacs 21.4.6 "Common Lisp" released December 17, 2001. 2078 XEmacs 21.4.6 ``Common Lisp'' released December 17, 2001.
2075 @item 2079 @item
2076 XEmacs 21.4.7 "Economic Science" released May 4, 2002. 2080 XEmacs 21.4.7 ``Economic Science'' released May 4, 2002.
2077 @item 2081 @item
2078 XEmacs 21.4.8 "Honest Recruiter" released May 9, 2002. 2082 XEmacs 21.4.8 ``Honest Recruiter'' released May 9, 2002.
2079 @item 2083 @item
2080 XEmacs 21.4.9 "Informed Management" released August 23, 2002. 2084 XEmacs 21.4.9 ``Informed Management'' released August 23, 2002.
2081 @item 2085 @item
2082 XEmacs 21.4.10 "Military Intelligence" released November 2, 2002. 2086 XEmacs 21.4.10 ``Military Intelligence'' released November 2, 2002.
2083 @item 2087 @item
2084 XEmacs 21.4.11 "Native Windows TTY Support" released January 3, 2003. 2088 XEmacs 21.4.11 ``Native Windows TTY Support'' released January 3, 2003.
2085 @item 2089 @item
2086 XEmacs 21.4.12 "Portable Code" released January 15, 2003. 2090 XEmacs 21.4.12 ``Portable Code'' released January 15, 2003.
2087 @item 2091 @item
2088 XEmacs 21.4.13 "Rational FORTRAN" released May 25, 2003. 2092 XEmacs 21.4.13 ``Rational FORTRAN'' released May 25, 2003.
2089 @item 2093 @item
2090 XEmacs 21.4.14 "Reasonable Discussion" released September 3, 2003. 2094 XEmacs 21.4.14 ``Reasonable Discussion'' released September 3, 2003.
2091 @item 2095 @item
2092 XEmacs 21.4.15 "Security Through Obscurity" released February 2, 2004. 2096 XEmacs 21.4.15 ``Security Through Obscurity'' released February 2, 2004.
2093 @item 2097 @item
2094 XEmacs 21.4.16 "Successful IPO" released December 5, 2004. 2098 XEmacs 21.4.16 ``Successful IPO'' released December 5, 2004.
2095 @item 2099 @item
2096 version 21.5.0 "alfalfa" released April 18, 2001. 2100 version 21.5.0 ``alfalfa'' released April 18, 2001.
2097 @item 2101 @item
2098 version 21.5.1 "anise" released May 9, 2001. 2102 version 21.5.1 ``anise'' released May 9, 2001.
2099 @item 2103 @item
2100 version 21.5.2 "artichoke" released July 28, 2001. 2104 version 21.5.2 ``artichoke'' released July 28, 2001.
2101 @item 2105 @item
2102 version 21.5.3 "asparagus" released September 7, 2001. 2106 version 21.5.3 ``asparagus'' released September 7, 2001.
2103 @item 2107 @item
2104 version 21.5.4 "bamboo" released January 8, 2002. 2108 version 21.5.4 ``bamboo'' released January 8, 2002.
2105 @item 2109 @item
2106 version 21.5.5 "beets" released March 5, 2002. 2110 version 21.5.5 ``beets'' released March 5, 2002.
2107 @item 2111 @item
2108 version 21.5.6 "bok choi" released April 5, 2002. 2112 version 21.5.6 ``bok choi'' released April 5, 2002.
2109 @item 2113 @item
2110 version 21.5.7 "broccoflower" released July 2, 2002. 2114 version 21.5.7 ``broccoflower'' released July 2, 2002.
2111 @item 2115 @item
2112 version 21.5.8 "broccoli" released July 27, 2002. 2116 version 21.5.8 ``broccoli'' released July 27, 2002.
2113 @item 2117 @item
2114 version 21.5.9 "brussels sprouts" released August 30, 2002. 2118 version 21.5.9 ``brussels sprouts'' released August 30, 2002.
2115 @item 2119 @item
2116 version 21.5.10 "burdock" released January 4, 2003. 2120 version 21.5.10 ``burdock'' released January 4, 2003.
2117 @item 2121 @item
2118 version 21.5.11 "cabbage" released February 16, 2003. 2122 version 21.5.11 ``cabbage'' released February 16, 2003.
2119 @item 2123 @item
2120 version 21.5.12 "carrot" released April 24, 2003. 2124 version 21.5.12 ``carrot'' released April 24, 2003.
2121 @item 2125 @item
2122 version 21.5.13 "cauliflower" released May 10, 2003. 2126 version 21.5.13 ``cauliflower'' released May 10, 2003.
2123 @item 2127 @item
2124 version 21.5.14 "cassava" released June 1, 2003. 2128 version 21.5.14 ``cassava'' released June 1, 2003.
2125 @item 2129 @item
2126 version 21.5.15 "celery" released September 3, 2003. 2130 version 21.5.15 ``celery'' released September 3, 2003.
2127 @item 2131 @item
2128 version 21.5.16 "celeriac" released September 26, 2003. 2132 version 21.5.16 ``celeriac'' released September 26, 2003.
2129 @item 2133 @item
2130 version 21.5.17 "chayote" released March 22, 2004. 2134 version 21.5.17 ``chayote'' released March 22, 2004.
2131 @item 2135 @item
2132 version 21.5.18 "chestnut" released October 22, 2004. 2136 version 21.5.18 ``chestnut'' released October 22, 2004.
2133 @end itemize 2137 @end itemize
2134 2138
2135 @node The XEmacs Split, XEmacs from the Outside, A History of Emacs, Top 2139 @node The XEmacs Split, XEmacs from the Outside, A History of Emacs, Top
2136 @chapter The XEmacs Split 2140 @chapter The XEmacs Split
2137 @cindex XEmacs split 2141 @cindex XEmacs split
2151 to cooperate a bit with RMS, and the two versions of Emacs will merge. In 2155 to cooperate a bit with RMS, and the two versions of Emacs will merge. In
2152 fact there have been six to seven major attempts at merging, each running 2156 fact there have been six to seven major attempts at merging, each running
2153 hundreds of messages long and all of them coming from the XEmacs side. All 2157 hundreds of messages long and all of them coming from the XEmacs side. All
2154 have failed because they have eventually come to the same conclusion, which 2158 have failed because they have eventually come to the same conclusion, which
2155 is that RMS has no real interest in cooperation at all. If you work with 2159 is that RMS has no real interest in cooperation at all. If you work with
2156 him, you have to do it his way -- "my way or the highway". Specifically: 2160 him, you have to do it his way -- ``my way or the highway''. Specifically:
2157 2161
2158 @enumerate 2162 @enumerate
2159 @item 2163 @item
2160 2164
2161 RMS insists on having legal papers signed for every bit of code that goes 2165 RMS insists on having legal papers signed for every bit of code that goes
4046 zero or more Kanji characters followed by zero or more 4050 zero or more Kanji characters followed by zero or more
4047 Hiragana characters. 4051 Hiragana characters.
4048 @end display 4052 @end display
4049 4053
4050 Then, the problem is that now we can't say that a sequence of 4054 Then, the problem is that now we can't say that a sequence of
4051 word-constituents makes up a word. For instance, both Hiragana "A" 4055 word-constituents makes up a word. For instance, both Hiragana ``A''
4052 and Kanji "KAN" are word-constituents but the sequence of these two 4056 and Kanji ``KAN'' are word-constituents but the sequence of these two
4053 letters can't be a single word. 4057 letters can't be a single word.
4054 4058
4055 So, we introduced Sextword for Japanese letters. 4059 So, we introduced Sextword for Japanese letters.
4056 @end quotation 4060 @end quotation
4057 4061
5006 @item 5010 @item
5007 Any header-file declarations of the sort 5011 Any header-file declarations of the sort
5008 5012
5009 struct foobar; 5013 struct foobar;
5010 5014
5011 go into the "types" section of lisp.h. 5015 go into the ``types'' section of @file{lisp.h}.
5012 @end itemize 5016 @end itemize
5013 5017
5014 @node Writing New Modules, Working with Lisp Objects, Introduction to Writing C Code, Rules When Writing New C Code 5018 @node Writing New Modules, Working with Lisp Objects, Introduction to Writing C Code, Rules When Writing New C Code
5015 @section Writing New Modules 5019 @section Writing New Modules
5016 @cindex writing new modules 5020 @cindex writing new modules
5664 correct it or flag it as incorrect, as described in the previous 5668 correct it or flag it as incorrect, as described in the previous
5665 paragraph. Whenever you work on a section of code, @emph{always} make 5669 paragraph. Whenever you work on a section of code, @emph{always} make
5666 sure to update any comments to be correct -- or, at the very least, flag 5670 sure to update any comments to be correct -- or, at the very least, flag
5667 them as incorrect. 5671 them as incorrect.
5668 5672
5669 To indicate a "todo" or other problem, use four pound signs -- 5673 To indicate a ``todo'' or other problem, use four pound signs --
5670 i.e. @samp{####}. 5674 i.e. @samp{####}.
5671 5675
5672 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code 5676 @node Adding Global Lisp Variables, Writing Macros, Writing Good Comments, Rules When Writing New C Code
5673 @section Adding Global Lisp Variables 5677 @section Adding Global Lisp Variables
5674 @cindex global Lisp variables, adding 5678 @cindex global Lisp variables, adding
5849 @enumerate 5853 @enumerate
5850 @item 5854 @item
5851 Anything that's an lvalue can be evaluated more than once. 5855 Anything that's an lvalue can be evaluated more than once.
5852 @item 5856 @item
5853 Macros where anything else can be evaluated more than once should 5857 Macros where anything else can be evaluated more than once should
5854 have the word "unsafe" in their name (exceptions may be made for 5858 have the word ``unsafe'' in their name (exceptions may be made for
5855 large sets of macros that evaluate arguments of certain types more 5859 large sets of macros that evaluate arguments of certain types more
5856 than once, e.g. struct buffer * arguments, when clearly indicated in 5860 than once, e.g. struct buffer * arguments, when clearly indicated in
5857 the macro documentation). These macros are generally meant to be 5861 the macro documentation). These macros are generally meant to be
5858 called only by other macros that have already stored the calling 5862 called only by other macros that have already stored the calling
5859 values in temporary variables. 5863 values in temporary variables.
5881 Capitalize macros doing stuff obviously impossible with (C) 5885 Capitalize macros doing stuff obviously impossible with (C)
5882 functions, e.g. directly modifying arguments as if they were passed by 5886 functions, e.g. directly modifying arguments as if they were passed by
5883 reference. 5887 reference.
5884 @item 5888 @item
5885 Capitalize macros that evaluate @strong{any} argument more than once regardless 5889 Capitalize macros that evaluate @strong{any} argument more than once regardless
5886 of whether that's "allowed" (e.g. buffer arguments). 5890 of whether that's ``allowed'' (e.g. buffer arguments).
5887 @item 5891 @item
5888 Capitalize macros that directly access a field in a Lisp_Object or 5892 Capitalize macros that directly access a field in a Lisp_Object or
5889 its equivalent underlying structure. In such cases, access through the 5893 its equivalent underlying structure. In such cases, access through the
5890 Lisp_Object precedes the macro with an X, and access through the underlying 5894 Lisp_Object precedes the macro with an X, and access through the underlying
5891 structure doesn't. 5895 structure doesn't.
5936 a search-and-replace is done to change type names and such. Some people 5940 a search-and-replace is done to change type names and such. Some people
5937 disagree with such changes, and certainly if done without good reason 5941 disagree with such changes, and certainly if done without good reason
5938 will just lead to headaches. But it's important to keep the code clean 5942 will just lead to headaches. But it's important to keep the code clean
5939 and understandable, and consistent naming goes a long way towards this. 5943 and understandable, and consistent naming goes a long way towards this.
5940 5944
5941 An example of the right way to do this was the so-called "great integral 5945 An example of the right way to do this was the so-called ``great integral
5942 type renaming". 5946 type renaming''.
5943 5947
5944 @menu 5948 @menu
5945 * Great Integral Type Renaming:: 5949 * Great Integral Type Renaming::
5946 * Text/Char Type Renaming:: 5950 * Text/Char Type Renaming::
5947 @end menu 5951 @end menu
5964 @item 5968 @item
5965 All integral types that measure quantities of anything are signed. Some 5969 All integral types that measure quantities of anything are signed. Some
5966 people disagree vociferously with this, but their arguments are mostly 5970 people disagree vociferously with this, but their arguments are mostly
5967 theoretical, and are vastly outweighed by the practical headaches of 5971 theoretical, and are vastly outweighed by the practical headaches of
5968 mixing signed and unsigned values, and more importantly by the far 5972 mixing signed and unsigned values, and more importantly by the far
5969 increased likelihood of inadvertent bugs: Because of the broken "viral" 5973 increased likelihood of inadvertent bugs: Because of the broken ``viral''
5970 nature of unsigned quantities in C (operations involving mixed 5974 nature of unsigned quantities in C (operations involving mixed
5971 signed/unsigned are done unsigned, when exactly the opposite is nearly 5975 signed/unsigned are done unsigned, when exactly the opposite is nearly
5972 always wanted), even a single error in declaring a quantity unsigned 5976 always wanted), even a single error in declaring a quantity unsigned
5973 that should be signed, or even the even more subtle error of comparing 5977 that should be signed, or even the even more subtle error of comparing
5974 signed and unsigned values and forgetting the necessary cast, can be 5978 signed and unsigned values and forgetting the necessary cast, can be
5975 catastrophic, as comparisons will yield wrong results. -Wsign-compare 5979 catastrophic, as comparisons will yield wrong results. @samp{-Wsign-compare}
5976 is turned on specifically to catch this, but this tends to result in a 5980 is turned on specifically to catch this, but this tends to result in a
5977 great number of warnings when mixing signed and unsigned, and the casts 5981 great number of warnings when mixing signed and unsigned, and the casts
5978 are annoying. More has been written on this elsewhere. 5982 are annoying. More has been written on this elsewhere.
5979 5983
5980 @item 5984 @item
5989 Type names should be relatively short (no more than 10 characters or 5993 Type names should be relatively short (no more than 10 characters or
5990 so), with the first letter capitalized and no underscores if they can at 5994 so), with the first letter capitalized and no underscores if they can at
5991 all be avoided. 5995 all be avoided.
5992 5996
5993 @item 5997 @item
5994 "count" == a zero-based measurement of some quantity. Includes sizes, 5998 ``count'' == a zero-based measurement of some quantity. Includes sizes,
5995 offsets, and indexes. 5999 offsets, and indexes.
5996 6000
5997 @item 6001 @item
5998 "bpos" == a one-based measurement of a position in a buffer. "Charbpos" 6002 ``bpos'' == a one-based measurement of a position in a buffer. ``Charbpos''
5999 and "Bytebpos" count text in the buffer, rather than bytes in memory; 6003 and ``Bytebpos'' count text in the buffer, rather than bytes in memory;
6000 thus Bytebpos does not directly correspond to the memory representation. 6004 thus Bytebpos does not directly correspond to the memory representation.
6001 Use "Membpos" for this. 6005 Use ``Membpos'' for this.
6002 6006
6003 @item 6007 @item
6004 "Char" refers to internal-format characters, not to the C type "char", 6008 ``Char'' refers to internal-format characters, not to the C type ``char'',
6005 which is really a byte. 6009 which is really a byte.
6006 @end itemize 6010 @end itemize
6007 6011
6008 For the actual name changes, see the script below. 6012 For the actual name changes, see the script below.
6009 6013
6094 #endif 6098 #endif
6095 6099
6096 /* The have been some arguments over the what the type should be that 6100 /* The have been some arguments over the what the type should be that
6097 specifies a count of bytes in a data block to be written out or read in, 6101 specifies a count of bytes in a data block to be written out or read in,
6098 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions. 6102 using @code{Lstream_read()}, @code{Lstream_write()}, and related functions.
6099 Originally it was long, which worked fine; Martin "corrected" these to 6103 Originally it was long, which worked fine; Martin ``corrected'' these to
6100 size_t and ssize_t on the grounds that this is theoretically cleaner and 6104 size_t and ssize_t on the grounds that this is theoretically cleaner and
6101 is in keeping with the C standards. Unfortunately, this practice is 6105 is in keeping with the C standards. Unfortunately, this practice is
6102 horribly error-prone due to design flaws in the way that mixed 6106 horribly error-prone due to design flaws in the way that mixed
6103 signed/unsigned arithmetic happens. In fact, by doing this change, 6107 signed/unsigned arithmetic happens. In fact, by doing this change,
6104 Martin introduced a subtle but fatal error that caused the operation of 6108 Martin introduced a subtle but fatal error that caused the operation of
6469 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark 6473 fixed---use the @code{Known-Bug-Expect-Failure} wrapper macro to mark
6470 them. 6474 them.
6471 6475
6472 @deffn Macro Known-Bug-Expect-Failure body 6476 @deffn Macro Known-Bug-Expect-Failure body
6473 Arrange for failing tests in @var{body} to generate messages prefixed 6477 Arrange for failing tests in @var{body} to generate messages prefixed
6474 with "KNOWN BUG:" instead of "FAIL:". @var{body} is a @code{progn}-like 6478 with ``KNOWN BUG:'' instead of ``FAIL:''. @var{body} is a @code{progn}-like
6475 body, and may contain several tests. 6479 body, and may contain several tests.
6476 @end deffn 6480 @end deffn
6477 6481
6478 A lot of the tests we run push limits; suppress Ebola warning messages 6482 A lot of the tests we run push limits; suppress Ebola warning messages
6479 with the @code{Ignore-Ebola} wrapper macro. 6483 with the @code{Ignore-Ebola} wrapper macro.
6650 with added or deleted files.} If you are lucky, the operation will 6654 with added or deleted files.} If you are lucky, the operation will
6651 simply fail. If you are less lucky, it will proceed, but make the 6655 simply fail. If you are less lucky, it will proceed, but make the
6652 adds and deletes on the main line, which you do not want at all. 6656 adds and deletes on the main line, which you do not want at all.
6653 Therefore, you must undo all adds and deletes. To find out what is 6657 Therefore, you must undo all adds and deletes. To find out what is
6654 added and deleted, use something like @code{cvs -n update >&! 6658 added and deleted, use something like @code{cvs -n update >&!
6655 cvs.out}, which does a "dry run". (You did make a backup copy first, 6659 cvs.out}, which does a ``dry run''. (You did make a backup copy first,
6656 right? What if you forgot the @samp{-n}, for example, and wasn't 6660 right? What if you forgot the @samp{-n}, for example, and wasn't
6657 prepared for the sudden onslaught of merging action?) Take a look at 6661 prepared for the sudden onslaught of merging action?) Take a look at
6658 the output file @file{cvs.out} and check very carefully for newly 6662 the output file @file{cvs.out} and check very carefully for newly
6659 added files (marked with an @samp{A}) and newly removed files (marked 6663 added files (marked with an @samp{A}) and newly removed files (marked
6660 with an @samp{R}). Double check that your newly added files are in 6664 with an @samp{R}). Double check that your newly added files are in
6682 crw tag -b ben-mule-21-5 6686 crw tag -b ben-mule-21-5
6683 @end example 6687 @end example
6684 6688
6685 Note that this doesn't actually do anything to your local workspace! 6689 Note that this doesn't actually do anything to your local workspace!
6686 It basically just creates another tag in the repository, identical to 6690 It basically just creates another tag in the repository, identical to
6687 the branch point tag but internally marked as a "branch tag" rather 6691 the branch point tag but internally marked as a ``branch tag'' rather
6688 than a regular tag. 6692 than a regular tag.
6689 6693
6690 @item 6694 @item
6691 Now, move your workspace onto the branch: 6695 Now, move your workspace onto the branch:
6692 6696
7016 and when you add a new element, the array automatically resizes itself 7020 and when you add a new element, the array automatically resizes itself
7017 if it isn't big enough. Dynarrs are extensively used in the redisplay 7021 if it isn't big enough. Dynarrs are extensively used in the redisplay
7018 mechanism. 7022 mechanism.
7019 7023
7020 7024
7021 A "dynamic array" is a contiguous array of fixed-size elements where there 7025 A ``dynamic array'' is a contiguous array of fixed-size elements where there
7022 is no upper limit (except available memory) on the number of elements in the 7026 is no upper limit (except available memory) on the number of elements in the
7023 array. Because the elements are maintained contiguously, space is used 7027 array. Because the elements are maintained contiguously, space is used
7024 efficiently (no per-element pointers necessary) and random access to a 7028 efficiently (no per-element pointers necessary) and random access to a
7025 particular element is in constant time. At any one point, the block of memory 7029 particular element is in constant time. At any one point, the block of memory
7026 that holds the array has an upper limit; if this limit is exceeded, the 7030 that holds the array has an upper limit; if this limit is exceeded, the
7027 memory is realloc()ed into a new array that is twice as big. Assuming that 7031 memory is @code{realloc()}ed into a new array that is twice as big. Assuming that
7028 the time to grow the array is on the order of the new size of the array 7032 the time to grow the array is on the order of the new size of the array
7029 block, this scheme has a provably constant amortized time (i.e. average 7033 block, this scheme has a provably constant amortized time (i.e. average
7030 time over all additions). 7034 time over all additions).
7031 7035
7032 When you add elements or retrieve elements, pointers are used. Note that 7036 When you add elements or retrieve elements, pointers are used. Note that
7130 onto a linked list, so they can be efficiently reused. This data type 7134 onto a linked list, so they can be efficiently reused. This data type
7131 is not much used in XEmacs currently, because it's a fairly new 7135 is not much used in XEmacs currently, because it's a fairly new
7132 addition. 7136 addition.
7133 7137
7134 7138
7135 A "block-type object" is used to efficiently allocate and free blocks 7139 A ``block-type object'' is used to efficiently allocate and free blocks
7136 of a particular size. Freed blocks are remembered in a free list and 7140 of a particular size. Freed blocks are remembered in a free list and
7137 are reused as necessary to allocate new blocks, so as to avoid as 7141 are reused as necessary to allocate new blocks, so as to avoid as
7138 much as possible making calls to malloc() and free(). 7142 much as possible making calls to @code{malloc()} and @code{free()}.
7139 7143
7140 This is a container object. Declare a block-type object of a specific type 7144 This is a container object. Declare a block-type object of a specific type
7141 as follows: 7145 as follows:
7142 7146
7143 struct mytype_blocktype @{ 7147 struct mytype_blocktype @{
8275 @code{this_one_is_unmarkable} in @code{alloc.c}). 8279 @code{this_one_is_unmarkable} in @code{alloc.c}).
8276 8280
8277 Now, the actual marking is feasible. We do so by once using the macro 8281 Now, the actual marking is feasible. We do so by once using the macro
8278 @code{MARK_RECORD_HEADER} to mark the object itself (actually the 8282 @code{MARK_RECORD_HEADER} to mark the object itself (actually the
8279 special flag in the lrecord header), and calling its special marker 8283 special flag in the lrecord header), and calling its special marker
8280 "method" @code{marker} if available. The marker method marks every 8284 ``method'' @code{marker} if available. The marker method marks every
8281 other object that is in reach from our current object. Note, that these 8285 other object that is in reach from our current object. Note, that these
8282 marker methods should not call @code{mark_object} recursively, but 8286 marker methods should not call @code{mark_object} recursively, but
8283 instead should return the next object from where further marking has to 8287 instead should return the next object from where further marking has to
8284 be performed. 8288 be performed.
8285 8289
8330 @code{sweep_conses}, @code{sweep_bit_vectors_1}, 8334 @code{sweep_conses}, @code{sweep_bit_vectors_1},
8331 @code{sweep_compiled_functions}, @code{sweep_floats}, 8335 @code{sweep_compiled_functions}, @code{sweep_floats},
8332 @code{sweep_symbols}, @code{sweep_extents}, @code{sweep_markers} and 8336 @code{sweep_symbols}, @code{sweep_extents}, @code{sweep_markers} and
8333 @code{sweep_extents}. They are the fixed-size types cons, floats, 8337 @code{sweep_extents}. They are the fixed-size types cons, floats,
8334 compiled-functions, symbol, marker, extent, and event stored in 8338 compiled-functions, symbol, marker, extent, and event stored in
8335 so-called "frob blocks", and therefore we can basically do the same on 8339 so-called ``frob blocks'', and therefore we can basically do the same on
8336 every type objects, using the same macros, especially defined only to 8340 every type objects, using the same macros, especially defined only to
8337 handle everything with respect to fixed-size blocks. The only fixed-size 8341 handle everything with respect to fixed-size blocks. The only fixed-size
8338 type that is not handled here are the fixed-size portion of strings, 8342 type that is not handled here are the fixed-size portion of strings,
8339 because we took special care of them earlier. 8343 because we took special care of them earlier.
8340 8344
10004 complicated depending on how much information we cache. In addition to 10008 complicated depending on how much information we cache. In addition to
10005 the known region, we always cache the correct conversions for point, 10009 the known region, we always cache the correct conversions for point,
10006 BEGV, and ZV, and in addition to this we cache 16 positions where the 10010 BEGV, and ZV, and in addition to this we cache 16 positions where the
10007 conversion is known. We only look in the cache or update it when we 10011 conversion is known. We only look in the cache or update it when we
10008 need to move the known region more than a certain amount (currently 50 10012 need to move the known region more than a certain amount (currently 50
10009 chars), and then we throw away a "random" value and replace it with the 10013 chars), and then we throw away a ``random'' value and replace it with the
10010 newly calculated value. 10014 newly calculated value.
10011 10015
10012 Finally, we maintain an extra flag that tracks whether the buffer is 10016 Finally, we maintain an extra flag that tracks whether the buffer is
10013 entirely ASCII, to speed up the conversions even more. This flag is 10017 entirely ASCII, to speed up the conversions even more. This flag is
10014 actually of dubious value because in an entirely-ASCII buffer the known 10018 actually of dubious value because in an entirely-ASCII buffer the known
10040 track of a shifter value (0, 1, or 2) indicating how much to shift. 10044 track of a shifter value (0, 1, or 2) indicating how much to shift.
10041 Multiplying by 3 can be implemented by doubling and then adding the 10045 Multiplying by 3 can be implemented by doubling and then adding the
10042 original value. Dividing by 3, alas, cannot be implemented in any 10046 original value. Dividing by 3, alas, cannot be implemented in any
10043 simple shift/subtract method, as far as I know; so we just do a table 10047 simple shift/subtract method, as far as I know; so we just do a table
10044 lookup. For simplicity, we use a table of size 128K, which indexes the 10048 lookup. For simplicity, we use a table of size 128K, which indexes the
10045 "divide-by-3" values for the first 64K non-negative numbers. (Note that 10049 ``divide-by-3'' values for the first 64K non-negative numbers. (Note that
10046 we can increase the size up to 384K, i.e. indexing the first 192K 10050 we can increase the size up to 384K, i.e. indexing the first 192K
10047 non-negative numbers, while still using shorts in the array.) This also 10051 non-negative numbers, while still using shorts in the array.) This also
10048 means that the size of the known region can be at most 64K for 10052 means that the size of the known region can be at most 64K for
10049 width-three characters. 10053 width-three characters.
10050 @end quotation 10054 @end quotation
10070 @item 10074 @item
10071 the position of the gap 10075 the position of the gap
10072 @item 10076 @item
10073 the last value we computed 10077 the last value we computed
10074 @item 10078 @item
10075 a set of positions that are "far away" from previously computed positions 10079 a set of positions that are ``far away'' from previously computed positions
10076 (5000 chars currently; #### perhaps should be smaller) 10080 (5000 chars currently; #### perhaps should be smaller)
10077 @end itemize 10081 @end itemize
10078 10082
10079 For each position, we @code{CONSIDER()} it. This means: 10083 For each position, we @code{CONSIDER()} it. This means:
10080 10084
10096 the simple loop in FSF with the use of @code{bytecount_to_charcount()}, 10100 the simple loop in FSF with the use of @code{bytecount_to_charcount()},
10097 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or 10101 @code{charcount_to_bytecount()}, @code{bytecount_to_charcount_down()}, or
10098 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.) 10102 @code{charcount_to_bytecount_down()}. (The latter two I added for this purpose.)
10099 These scan 4 or 8 bytes at a time through purely single-byte characters. 10103 These scan 4 or 8 bytes at a time through purely single-byte characters.
10100 10104
10101 If the amount we had to scan was more than our "far away" distance (5000 10105 If the amount we had to scan was more than our ``far away'' distance (5000
10102 characters, see above), then cache the new position. 10106 characters, see above), then cache the new position.
10103 10107
10104 #### Things to do: 10108 #### Things to do:
10105 10109
10106 @itemize @bullet 10110 @itemize @bullet
10107 @item 10111 @item
10108 Look at the most recent GNU Emacs to see whether anything has changed. 10112 Look at the most recent GNU Emacs to see whether anything has changed.
10109 @item 10113 @item
10110 Think about whether it makes sense to try to implement some sort of 10114 Think about whether it makes sense to try to implement some sort of
10111 known region or list of "known regions", like we had before. This would 10115 known region or list of ``known regions'', like we had before. This would
10112 be a region of entirely single-byte characters that we can check very 10116 be a region of entirely single-byte characters that we can check very
10113 quickly. (Previously I used a range of same-width characters of any 10117 quickly. (Previously I used a range of same-width characters of any
10114 size; but this adds extra complexity and slows down the scanning, and is 10118 size; but this adds extra complexity and slows down the scanning, and is
10115 probably not worth it.) As part of the scanning process in 10119 probably not worth it.) As part of the scanning process in
10116 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely 10120 @code{bytecount_to_charcount()} et al, we skip over chunks of entirely
10324 In terms of reading the actual code, there are five optimizations 10328 In terms of reading the actual code, there are five optimizations
10325 (obfuscations, if you like) that have been done. 10329 (obfuscations, if you like) that have been done.
10326 10330
10327 @enumerate 10331 @enumerate
10328 @item 10332 @item
10329 An explicit "failure stack" has been substituted for recursion. 10333 An explicit ``failure stack'' has been substituted for recursion.
10330 10334
10331 @item 10335 @item
10332 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions 10336 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions
10333 are actually inlined into the @code{match} function for efficiency. 10337 are actually inlined into the @code{match} function for efficiency.
10334 Then the pointer movement is interspersed with the matching operations. 10338 Then the pointer movement is interspersed with the matching operations.
10337 If the operator uses buffer context, the buffer pointer movement is 10341 If the operator uses buffer context, the buffer pointer movement is
10338 sometimes implicit in the operations retrieving the context. 10342 sometimes implicit in the operations retrieving the context.
10339 10343
10340 @item 10344 @item
10341 Some cases are combined into short preparation for individual cases, and 10345 Some cases are combined into short preparation for individual cases, and
10342 a "fall-through" into combined code for several cases. 10346 a ``fall-through'' into combined code for several cases.
10343 10347
10344 @item 10348 @item
10345 The @code{pattern} type is not an explicit @samp{struct}. Instead, the 10349 The @code{pattern} type is not an explicit @samp{struct}. Instead, the
10346 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the 10350 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the
10347 compiled bytecode. This leads to bizarre code in the interpreter like 10351 compiled bytecode. This leads to bizarre code in the interpreter like
10356 @example 10360 @example
10357 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ... 10361 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ...
10358 @end example 10362 @end example
10359 @end enumerate 10363 @end enumerate
10360 10364
10361 But if you keep your eye on the "switch in a loop" structure, you 10365 But if you keep your eye on the ``switch in a loop'' structure, you
10362 should be able to understand the parts you need. 10366 should be able to understand the parts you need.
10363 10367
10364 @node Multilingual Support, Consoles; Devices; Frames; Windows, Text, Top 10368 @node Multilingual Support, Consoles; Devices; Frames; Windows, Text, Top
10365 @chapter Multilingual Support 10369 @chapter Multilingual Support
10366 @cindex Mule character sets and encodings 10370 @cindex Mule character sets and encodings
10818 a simple charset like ASCII, there is only one encoding normally used -- 10822 a simple charset like ASCII, there is only one encoding normally used --
10819 each character is represented by a single byte, with the same value as 10823 each character is represented by a single byte, with the same value as
10820 its code point. For more complicated charsets, however, things are not 10824 its code point. For more complicated charsets, however, things are not
10821 so obvious. Unicode version 2, for example, is a large charset with 10825 so obvious. Unicode version 2, for example, is a large charset with
10822 thousands of characters, each indexed by a 16-bit number, often 10826 thousands of characters, each indexed by a 16-bit number, often
10823 represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One 10827 represented in hex, e.g. 0x05D0 for the Hebrew letter ``aleph''. One
10824 obvious encoding uses two bytes per character (actually two encodings, 10828 obvious encoding uses two bytes per character (actually two encodings,
10825 depending on which of the two possible byte orderings is chosen). This 10829 depending on which of the two possible byte orderings is chosen). This
10826 encoding is convenient for internal processing of Unicode text; however, 10830 encoding is convenient for internal processing of Unicode text; however,
10827 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is 10831 it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is
10828 usually used for external text, for example files or e-mail. UTF-8 10832 usually used for external text, for example files or e-mail. UTF-8
10839 10843
10840 In an ASCII or single-European-character-set world, life is very simple. 10844 In an ASCII or single-European-character-set world, life is very simple.
10841 There are 256 characters, and each character is represented using the 10845 There are 256 characters, and each character is represented using the
10842 numbers 0 through 255, which fit into a single byte. With a few 10846 numbers 0 through 255, which fit into a single byte. With a few
10843 exceptions (such as case-changing operations or syntax classes like 10847 exceptions (such as case-changing operations or syntax classes like
10844 'whitespace'), "text" is simply an array of indices into a font. You 10848 @code{whitespace}), ``text'' is simply an array of indices into a font. You
10845 can get different languages simply by choosing fonts with different 10849 can get different languages simply by choosing fonts with different
10846 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and 10850 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and
10847 everything will "just work" as long as anyone else receiving your text 10851 everything will ``just work'' as long as anyone else receiving your text
10848 uses a compatible font. 10852 uses a compatible font.
10849 10853
10850 In the multi-lingual world, however, it is much more complicated. There 10854 In the multi-lingual world, however, it is much more complicated. There
10851 are a great number of different characters which are organized in a 10855 are a great number of different characters which are organized in a
10852 complex fashion into various character sets. The representation to use 10856 complex fashion into various character sets. The representation to use
10892 text as possible. No operations should ever be performed on text encoded 10896 text as possible. No operations should ever be performed on text encoded
10893 in an external representation other than simple copying, because no 10897 in an external representation other than simple copying, because no
10894 assumptions can reliably be made about the format of this text. You 10898 assumptions can reliably be made about the format of this text. You
10895 cannot assume, for example, that the end of text is terminated by a null 10899 cannot assume, for example, that the end of text is terminated by a null
10896 byte. (For example, if the text is Unicode, it will have many null bytes 10900 byte. (For example, if the text is Unicode, it will have many null bytes
10897 in it.) You cannot find the next "slash" character by searching through 10901 in it.) You cannot find the next ``slash'' character by searching through
10898 the bytes until you find a byte that looks like a "slash" character, 10902 the bytes until you find a byte that looks like a ``slash'' character,
10899 because it might actually be the second byte of a Kanji character. 10903 because it might actually be the second byte of a Kanji character.
10900 Furthermore, all text in the internal representation must be converted, 10904 Furthermore, all text in the internal representation must be converted,
10901 even if it is known to be completely ASCII, because the external 10905 even if it is known to be completely ASCII, because the external
10902 representation may not be ASCII compatible (for example, if it is 10906 representation may not be ASCII compatible (for example, if it is
10903 Unicode). 10907 Unicode).
10923 the structures of a particular external encoding and the methods required 10927 the structures of a particular external encoding and the methods required
10924 to convert to and from this encoding. A facility exists to create coding 10928 to convert to and from this encoding. A facility exists to create coding
10925 system aliases, which in essence gives a single coding system two 10929 system aliases, which in essence gives a single coding system two
10926 different names. It is effectively used in XEmacs to provide a layer of 10930 different names. It is effectively used in XEmacs to provide a layer of
10927 abstraction on top of the actual coding systems. For example, the coding 10931 abstraction on top of the actual coding systems. For example, the coding
10928 system alias "file-name" points to whichever coding system is currently 10932 system alias ``file-name'' points to whichever coding system is currently
10929 used for encoding and decoding file names as passed to or retrieved from 10933 used for encoding and decoding file names as passed to or retrieved from
10930 system calls. In general, the actual encoding will differ from system to 10934 system calls. In general, the actual encoding will differ from system to
10931 system, and also on the particular locale that the user is in. The use 10935 system, and also on the particular locale that the user is in. The use
10932 of the file-name alias effectively hides that implementation detail on 10936 of the file-name alias effectively hides that implementation detail on
10933 top of that abstract interface layer which provides a unified set of 10937 top of that abstract interface layer which provides a unified set of
11434 C = plain char, when the base type is unsigned 11438 C = plain char, when the base type is unsigned
11435 U = unsigned 11439 U = unsigned
11436 S = signed 11440 S = signed
11437 @end example 11441 @end example
11438 11442
11439 (Formerly I had a comment saying that type (e) "should be replaced with 11443 (Formerly I had a comment saying that type (e) ``should be replaced with
11440 void *". However, there are in fact many places where an unsigned char 11444 void *''. However, there are in fact many places where an unsigned char
11441 * might be used -- e.g. for ease in pointer computation, since void * 11445 * might be used -- e.g. for ease in pointer computation, since void *
11442 doesn't allow this, and for compatibility with external APIs.) 11446 doesn't allow this, and for compatibility with external APIs.)
11443 11447
11444 Note that these typedefs are purely for documentation purposes; from 11448 Note that these typedefs are purely for documentation purposes; from
11445 the C code's perspective, they are exactly equivalent to @code{char *}, 11449 the C code's perspective, they are exactly equivalent to @code{char *},
11456 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs 11460 @node Different Ways of Seeing Internal Text, Buffer Positions, Byte Types, Byte/Character Types; Buffer Positions; Other Typedefs
11457 @subsection Different Ways of Seeing Internal Text 11461 @subsection Different Ways of Seeing Internal Text
11458 @cindex different ways of seeing internal text 11462 @cindex different ways of seeing internal text
11459 11463
11460 There are various ways of representing internal text. The two primary 11464 There are various ways of representing internal text. The two primary
11461 ways are as an "array" of individual characters; the other is as a 11465 ways are as an ``array'' of individual characters; the other is as a
11462 "stream" of bytes. In the ASCII world, where there are only 255 11466 ``stream'' of bytes. In the ASCII world, where there are only 255
11463 characters at most, things are easy because each character fits into a 11467 characters at most, things are easy because each character fits into a
11464 byte. In general, however, this is not true -- see the above discussion 11468 byte. In general, however, this is not true -- see the above discussion
11465 of characters vs. encodings. 11469 of characters vs. encodings.
11466 11470
11467 In some cases, it's also important to distinguish between a stream 11471 In some cases, it's also important to distinguish between a stream
11468 representation as a series of bytes and as a series of textual units. 11472 representation as a series of bytes and as a series of textual units.
11469 This is particularly important wrt Unicode. The UTF-16 representation 11473 This is particularly important wrt Unicode. The UTF-16 representation
11470 (sometimes referred to, rather sloppily, as simply the "Unicode" format) 11474 (sometimes referred to, rather sloppily, as simply the ``Unicode'' format)
11471 represents text as a series of 16-bit units. Mostly, each unit 11475 represents text as a series of 16-bit units. Mostly, each unit
11472 corresponds to a single character, but not necessarily, as characters 11476 corresponds to a single character, but not necessarily, as characters
11473 outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of 11477 outside of the range 0-65535 (the BMP or ``Basic Multilingual Plane'' of
11474 Unicode) require two 16-bit units, through the mechanism of 11478 Unicode) require two 16-bit units, through the mechanism of
11475 "surrogates". When a series of 16-bit units is serialized into a byte 11479 ``surrogates''. When a series of 16-bit units is serialized into a byte
11476 stream, there are at least two possible representations, little-endian 11480 stream, there are at least two possible representations, little-endian
11477 and big-endian, and which one is used may depend on the native format of 11481 and big-endian, and which one is used may depend on the native format of
11478 16-bit integers in the CPU of the machine that XEmacs is running 11482 16-bit integers in the CPU of the machine that XEmacs is running
11479 on. (Similarly, UTF-32 is logically a representation with 32-bit textual 11483 on. (Similarly, UTF-32 is logically a representation with 32-bit textual
11480 units.) 11484 units.)
11487 @item 11491 @item
11488 UTF-16 has 2-byte (16-bit) units. 11492 UTF-16 has 2-byte (16-bit) units.
11489 @item 11493 @item
11490 UTF-32 has 4-byte (32-bit) units. 11494 UTF-32 has 4-byte (32-bit) units.
11491 @item 11495 @item
11492 XEmacs-internal encoding (the old "Mule" encoding) has 1-byte (8-bit) 11496 XEmacs-internal encoding (the old ``Mule'' encoding) has 1-byte (8-bit)
11493 units. 11497 units.
11494 @item 11498 @item
11495 UTF-7 technically has 7-bit units that are within the "mail-safe" range 11499 UTF-7 technically has 7-bit units that are within the ``mail-safe'' range
11496 (ASCII 32 - 126 plus a few control characters), but normally is encoded 11500 (ASCII 32 - 126 plus a few control characters), but normally is encoded
11497 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a 11501 in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a
11498 normal mode where printable ASCII characters represent themselves and a 11502 normal mode where printable ASCII characters represent themselves and a
11499 shifted mode, introduced with a plus sign, where a base-64 encoding is 11503 shifted mode, introduced with a plus sign, where a base-64 encoding is
11500 used.) 11504 used.)
11555 @table @code 11559 @table @code
11556 @item Ibyte 11560 @item Ibyte
11557 The data in a buffer or string is logically made up of Ibyte objects, 11561 The data in a buffer or string is logically made up of Ibyte objects,
11558 where a Ibyte takes up the same amount of space as a char. (It is 11562 where a Ibyte takes up the same amount of space as a char. (It is
11559 declared differently, though, to catch invalid usages.) Strings stored 11563 declared differently, though, to catch invalid usages.) Strings stored
11560 using Ibytes are said to be in "internal format". The important 11564 using Ibytes are said to be in ``internal format''. The important
11561 characteristics of internal format are 11565 characteristics of internal format are
11562 11566
11563 @itemize @minus 11567 @itemize @minus
11564 @item 11568 @item
11565 ASCII characters are represented as a single Ibyte, in the range 0 - 11569 ASCII characters are represented as a single Ibyte, in the range 0 -
11608 11612
11609 This means that Ichar values are upwardly compatible with the standard 11613 This means that Ichar values are upwardly compatible with the standard
11610 8-bit representation of ASCII/ISO-8859-1. 11614 8-bit representation of ASCII/ISO-8859-1.
11611 11615
11612 @item Extbyte 11616 @item Extbyte
11613 Strings that go in or out of Emacs are in "external format", typedef'ed 11617 Strings that go in or out of Emacs are in ``external format'', typedef'ed
11614 as an array of char or a char *. There is more than one external format 11618 as an array of char or a char *. There is more than one external format
11615 (JIS, EUC, etc.) but they all have similar properties. They are modal 11619 (JIS, EUC, etc.) but they all have similar properties. They are modal
11616 encodings, which is to say that the meaning of particular bytes is not 11620 encodings, which is to say that the meaning of particular bytes is not
11617 fixed but depends on what "mode" the string is currently in (e.g. bytes 11621 fixed but depends on what ``mode'' the string is currently in (e.g. bytes
11618 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or 11622 in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or
11619 as 2-byte Kanji, depending on the current mode). The mode starts out in 11623 as 2-byte Kanji, depending on the current mode). The mode starts out in
11620 ASCII/ISO-8859-1 and is switched using escape sequences -- for example, 11624 ASCII/ISO-8859-1 and is switched using escape sequences -- for example,
11621 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes 11625 in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes
11622 in the range 0 - 0x7f are interpreted as Kanji characters. 11626 in the range 0 - 0x7f are interpreted as Kanji characters.
11642 11646
11643 There are three possible ways to specify positions in a buffer. All 11647 There are three possible ways to specify positions in a buffer. All
11644 of these are one-based: the beginning of the buffer is position or 11648 of these are one-based: the beginning of the buffer is position or
11645 index 1, and 0 is not a valid position. 11649 index 1, and 0 is not a valid position.
11646 11650
11647 As a "buffer position" (typedef Charbpos): 11651 As a ``buffer position'' (typedef Charbpos):
11648 11652
11649 This is an index specifying an offset in characters from the 11653 This is an index specifying an offset in characters from the
11650 beginning of the buffer. Note that buffer positions are 11654 beginning of the buffer. Note that buffer positions are
11651 logically @strong{between} characters, not on a character. The 11655 logically @strong{between} characters, not on a character. The
11652 difference between two buffer positions specifies the number of 11656 difference between two buffer positions specifies the number of
11653 characters between those positions. Buffer positions are the 11657 characters between those positions. Buffer positions are the
11654 only kind of position externally visible to the user. 11658 only kind of position externally visible to the user.
11655 11659
11656 As a "byte index" (typedef Bytebpos): 11660 As a ``byte index'' (typedef Bytebpos):
11657 11661
11658 This is an index over the bytes used to represent the characters 11662 This is an index over the bytes used to represent the characters
11659 in the buffer. If there is no Mule support, this is identical 11663 in the buffer. If there is no Mule support, this is identical
11660 to a buffer position, because each character is represented 11664 to a buffer position, because each character is represented
11661 using one byte. However, with Mule support, many characters 11665 using one byte. However, with Mule support, many characters
11662 require two or more bytes for their representation, and so a 11666 require two or more bytes for their representation, and so a
11663 byte index may be greater than the corresponding buffer 11667 byte index may be greater than the corresponding buffer
11664 position. 11668 position.
11665 11669
11666 As a "memory index" (typedef Membpos): 11670 As a ``memory index'' (typedef Membpos):
11667 11671
11668 This is the byte index adjusted for the gap. For positions 11672 This is the byte index adjusted for the gap. For positions
11669 before the gap, this is identical to the byte index. For 11673 before the gap, this is identical to the byte index. For
11670 positions after the gap, this is the byte index plus the gap 11674 positions after the gap, this is the byte index plus the gap
11671 size. There are two possible memory indices for the gap 11675 size. There are two possible memory indices for the gap
11672 position; the memory index at the beginning of the gap should 11676 position; the memory index at the beginning of the gap should
11673 always be used, except in code that deals with manipulating the 11677 always be used, except in code that deals with manipulating the
11674 gap, where both indices may be seen. The address of the 11678 gap, where both indices may be seen. The address of the
11675 character "at" (i.e. following) a particular position can be 11679 character ``at'' (i.e. following) a particular position can be
11676 obtained from the formula 11680 obtained from the formula
11677 11681
11678 buffer_start_address + memory_index(position) - 1 11682 buffer_start_address + memory_index(position) - 1
11679 11683
11680 except in the case of characters at the gap position. 11684 except in the case of characters at the gap position.
11779 use the buffer-level functions in buffer.h, which automatically know the 11783 use the buffer-level functions in buffer.h, which automatically know the
11780 correct format and handle the gap. 11784 correct format and handle the gap.
11781 11785
11782 Some terminology: 11786 Some terminology:
11783 11787
11784 "itext" appearing in the macros means "internal-format text" -- type 11788 itext" appearing in the macros means "internal-format text" -- type
11785 @code{Ibyte *}. Operations on such pointers themselves, rather than on the 11789 @code{Ibyte *}. Operations on such pointers themselves, rather than on the
11786 text being pointed to, have "itext" instead of "itext" in the macro 11790 text being pointed to, have "itext" instead of "itext" in the macro
11787 name. "ichar" in the macro names means an Ichar -- the representation 11791 name. "ichar" in the macro names means an Ichar -- the representation
11788 of a character as a single integer rather than a series of bytes, as part 11792 of a character as a single integer rather than a series of bytes, as part
11789 of "itext". Many of the macros below are for converting between the 11793 of "itext". Many of the macros below are for converting between the
11988 @item 11992 @item
11989 (c) using the GCC extension (@{ ... @}). 11993 (c) using the GCC extension (@{ ... @}).
11990 @end itemize 11994 @end itemize
11991 11995
11992 Turned out that all of the above had bugs, all caused by GCC (hence the 11996 Turned out that all of the above had bugs, all caused by GCC (hence the
11993 comments about "those GCC wankers" and "ream gcc up the ass"). As for 11997 comments about ``those GCC wankers'' and ``ream gcc up the ass''). As for
11994 (a), some versions of GCC (especially on Intel platforms), which had 11998 (a), some versions of GCC (especially on Intel platforms), which had
11995 buggy implementations of @code{alloca()} that couldn't handle being called 11999 buggy implementations of @code{alloca()} that couldn't handle being called
11996 inside of a function call -- they just decremented the stack right in the 12000 inside of a function call -- they just decremented the stack right in the
11997 middle of pushing args. Oops, crash with stack trashing, very bad. (b) 12001 middle of pushing args. Oops, crash with stack trashing, very bad. (b)
11998 was an attempt to fix (a), and that led to further GCC crashes, esp. when 12002 was an attempt to fix (a), and that led to further GCC crashes, esp. when
12971 consistency. For example, the new Mule workspace contains Ibyte 12975 consistency. For example, the new Mule workspace contains Ibyte
12972 versions of the stdlib string functions. 12976 versions of the stdlib string functions.
12973 @item Extbyte, UExtbyte 12977 @item Extbyte, UExtbyte
12974 Pointer to text in some external format, which can be defined as all 12978 Pointer to text in some external format, which can be defined as all
12975 formats other than the internal one. The data representing a string 12979 formats other than the internal one. The data representing a string
12976 in "external" format (binary or any external encoding) is logically a 12980 in ``external'' format (binary or any external encoding) is logically a
12977 set of Extbytes. Extbyte is guaranteed to be just a char, so for 12981 set of Extbytes. Extbyte is guaranteed to be just a char, so for
12978 example strlen (Extbyte *) is OK. Extbyte is only a documentation 12982 example strlen (Extbyte *) is OK. Extbyte is only a documentation
12979 device for referring to external text. 12983 device for referring to external text.
12980 @item Ascbyte, UAscbyte 12984 @item Ascbyte, UAscbyte
12981 pure ASCII text, consisting of bytesf in a string in entirely US-ASCII 12985 pure ASCII text, consisting of bytesf in a string in entirely US-ASCII
13115 13119
13116 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule 13120 @node Mule-izing Code, , An Example of Mule-Aware Code, Coding for Mule
13117 @subsection Mule-izing Code 13121 @subsection Mule-izing Code
13118 13122
13119 A lot of code is written without Mule in mind, and needs to be made 13123 A lot of code is written without Mule in mind, and needs to be made
13120 Mule-correct or "Mule-ized". There is really no substitute for 13124 Mule-correct or ``Mule-ized''. There is really no substitute for
13121 line-by-line analysis when doing this, but the following checklist can 13125 line-by-line analysis when doing this, but the following checklist can
13122 help: 13126 help:
13123 13127
13124 @itemize @bullet 13128 @itemize @bullet
13125 @item 13129 @item
13333 @item 13337 @item
13334 Look in the CRT sources! They come with VC++. See win32.c. 13338 Look in the CRT sources! They come with VC++. See win32.c.
13335 @end enumerate 13339 @end enumerate
13336 13340
13337 @node Locales, More about code pages, Microsoft Documentation, Microsoft Windows-Related Multilingual Issues 13341 @node Locales, More about code pages, Microsoft Documentation, Microsoft Windows-Related Multilingual Issues
13338 @subsection Locales, code pages, and other concepts of "language" 13342 @subsection Locales, code pages, and other concepts of ``language''
13339 @cindex locales, code pages, and other concepts of "language" 13343 @cindex locales, code pages, and other concepts of ``language''
13340 13344
13341 First, make sure you clearly understand the difference between the C 13345 First, make sure you clearly understand the difference between the C
13342 runtime library (CRT) and the Win32 API! See win32.c. 13346 runtime library (CRT) and the Win32 API! See win32.c.
13343 13347
13344 There are various different ways of representing the vague concept 13348 There are various different ways of representing the vague concept
13345 of "language", and it can be very confusing. So: 13349 of ``language'', and it can be very confusing. So:
13346 13350
13347 @itemize @bullet 13351 @itemize @bullet
13348 @item 13352 @item
13349 The CRT library has the concept of "locale", which is a 13353 The CRT library has the concept of ``locale'', which is a
13350 combination of language and country, and which controls the way 13354 combination of language and country, and which controls the way
13351 currency and dates are displayed, the encoding of data, etc. 13355 currency and dates are displayed, the encoding of data, etc.
13352 13356
13353 @item 13357 @item
13354 XEmacs has the concept of "language environment", more or less 13358 XEmacs has the concept of ``language environment'', more or less
13355 like a locale; although currently in most cases it just refers to 13359 like a locale; although currently in most cases it just refers to
13356 the language, and no sub-language distinctions are 13360 the language, and no sub-language distinctions are
13357 made. (Exceptions are with Chinese, which has different language 13361 made. (Exceptions are with Chinese, which has different language
13358 environments for Taiwan and mainland China, due to the different 13362 environments for Taiwan and mainland China, due to the different
13359 encodings and writing systems.) 13363 encodings and writing systems.)
13361 @item 13365 @item
13362 Windows has a number of different language concepts: 13366 Windows has a number of different language concepts:
13363 13367
13364 @enumerate 13368 @enumerate
13365 @item 13369 @item
13366 There are "languages" and "sublanguages", which correspond to 13370 There are ``languages'' and ``sublanguages'', which correspond to
13367 the languages and countries of the C library -- e.g. LANG_ENGLISH 13371 the languages and countries of the C library -- e.g. LANG_ENGLISH
13368 and SUBLANG_ENGLISH_US. These are identified by 8-bit integers, 13372 and SUBLANG_ENGLISH_US. These are identified by 8-bit integers,
13369 called the "primary language identifier" and "sublanguage 13373 called the ``primary language identifier'' and ``sublanguage
13370 identifier", respectively. These are combined into a 16-bit 13374 identifier'', respectively. These are combined into a 16-bit
13371 integer or "language identifier" by MAKELANGID(). 13375 integer or ``language identifier'' by @code{MAKELANGID()}.
13372 13376
13373 @item 13377 @item
13374 The language identifier in turn is combined with a "sort 13378 The language identifier in turn is combined with a ``sort
13375 identifier" (and optionally a "sort version") to yield a 32-bit 13379 identifier'' (and optionally a ``sort version'') to yield a 32-bit
13376 integer called a "locale identifier" (type LCID), which identifies 13380 integer called a ``locale identifier'' (type LCID), which identifies
13377 locales -- the primary means of distinguishing language/regional 13381 locales -- the primary means of distinguishing language/regional
13378 settings and similar to C library locales. 13382 settings and similar to C library locales.
13379 13383
13380 @item 13384 @item
13381 A "code page" combines the XEmacs concepts of "charset" and "coding 13385 A ``code page'' combines the XEmacs concepts of ``charset'' and ``coding
13382 system". It logically encompasses 13386 system''. It logically encompasses
13383 13387
13384 @itemize @minus 13388 @itemize @minus
13385 @item 13389 @item
13386 a set of supported characters 13390 a set of supported characters
13387 @item 13391 @item
13390 supported 13394 supported
13391 @item 13395 @item
13392 a way of encoding a series of characters into a string of bytes 13396 a way of encoding a series of characters into a string of bytes
13393 @end itemize 13397 @end itemize
13394 13398
13395 Note that the first two properties correspond to an XEmacs "charset" 13399 Note that the first two properties correspond to an XEmacs ``charset''
13396 and the latter an XEmacs "coding system". 13400 and the latter an XEmacs ``coding system''.
13397 13401
13398 Traditional encodings are either simple one-byte encodings, or 13402 Traditional encodings are either simple one-byte encodings, or
13399 combination one-byte/two-byte encodings (aka MBCS encodings, where MBCS 13403 combination one-byte/two-byte encodings (aka MBCS encodings, where MBCS
13400 stands for "Multibyte Character Set") with the following properties: 13404 stands for ``Multibyte Character Set'') with the following properties:
13401 13405
13402 @itemize @minus 13406 @itemize @minus
13403 @item 13407 @item
13404 all characters are encoded as a one-byte or two-byte sequence 13408 all characters are encoded as a one-byte or two-byte sequence
13405 @item 13409 @item
13406 the encoding is stateless (non-modal) 13410 the encoding is stateless (non-modal)
13407 @item 13411 @item
13408 the lower 128 bytes are compatible with ASCII 13412 the lower 128 bytes are compatible with ASCII
13409 @item 13413 @item
13410 in the higher bytes, the value of the first byte ("lead byte") 13414 in the higher bytes, the value of the first byte (``lead byte'')
13411 determines whether a second byte follows 13415 determines whether a second byte follows
13412 @item 13416 @item
13413 the values used for second bytes may overlap those used for first 13417 the values used for second bytes may overlap those used for first
13414 bytes, and (in some encodings) include values in the low half; thus, 13418 bytes, and (in some encodings) include values in the low half; thus,
13415 moving backwards is hard, and pure-ASCII algorithms (e.g. finding the 13419 moving backwards is hard, and pure-ASCII algorithms (e.g. finding the
13427 Every Windows locale has four associated code pages: ANSI (an 13431 Every Windows locale has four associated code pages: ANSI (an
13428 international standard or some Microsoft-created approximation; the 13432 international standard or some Microsoft-created approximation; the
13429 native code page under Windows), OEM (a DOS encoding, still used in the 13433 native code page under Windows), OEM (a DOS encoding, still used in the
13430 FAT file system), Mac (an encoding used on the Macintosh) and EBCDIC (a 13434 FAT file system), Mac (an encoding used on the Macintosh) and EBCDIC (a
13431 non-ASCII-compatible encoding used on IBM mainframes, originally based 13435 non-ASCII-compatible encoding used on IBM mainframes, originally based
13432 on the BCD or "binary-coded decimal" encoding of numbers). All code 13436 on the BCD or ``binary-coded decimal'' encoding of numbers). All code
13433 pages associated with a locale follow (as far as I know) the properties 13437 pages associated with a locale follow (as far as I know) the properties
13434 listed above for traditional code pages. More than one locale can share 13438 listed above for traditional code pages. More than one locale can share
13435 a code page -- e.g. all the Western European languages, including 13439 a code page -- e.g. all the Western European languages, including
13436 English, do. 13440 English, do.
13437 13441
13438 @item 13442 @item
13439 Windows also has an "input locale identifier" (aka "keyboard 13443 Windows also has an ``input locale identifier'' (aka ``keyboard
13440 layout id") or HKL, which is a 32-bit integer composed of the 13444 layout id'') or HKL, which is a 32-bit integer composed of the
13441 16-bit language identifier and a 16-bit "device identifier", which 13445 16-bit language identifier and a 16-bit ``device identifier'', which
13442 originally specified a particular keyboard layout (e.g. the locale 13446 originally specified a particular keyboard layout (e.g. the locale
13443 "US English" can have the QWERTY layout, the Dvorak layout, etc.), 13447 ``US English'' can have the QWERTY layout, the Dvorak layout, etc.),
13444 but has been expanded to include speech-to-text converters and 13448 but has been expanded to include speech-to-text converters and
13445 other non-keyboard ways of inputting text. Note that both the HKL 13449 other non-keyboard ways of inputting text. Note that both the HKL
13446 and LCID share the language identifier in the lower 16 bits, and in 13450 and LCID share the language identifier in the lower 16 bits, and in
13447 both cases a 0 in the upper 16 bits means "default" (sort order or 13451 both cases a 0 in the upper 16 bits means ``default'' (sort order or
13448 device), providing a way to convert between HKL's, LCID's, and 13452 device), providing a way to convert between HKL's, LCID's, and
13449 language identifiers (i.e. language/sublanguage pairs). The 13453 language identifiers (i.e. language/sublanguage pairs). The
13450 default keyboard layout for a language is (as far as I can 13454 default keyboard layout for a language is (as far as I can
13451 determine) established using the Regional Settings control panel 13455 determine) established using the Regional Settings control panel
13452 applet, where you can add input locales as combinations of language 13456 applet, where you can add input locales as combinations of language
13460 13464
13461 @node More about code pages, More about locales, Locales, Microsoft Windows-Related Multilingual Issues 13465 @node More about code pages, More about locales, Locales, Microsoft Windows-Related Multilingual Issues
13462 @subsection More about code pages 13466 @subsection More about code pages
13463 @cindex more about code pages 13467 @cindex more about code pages
13464 13468
13465 Here is what MSDN says about code pages (article "Code Pages"): 13469 Here is what MSDN says about code pages (article ``Code Pages''):
13466 13470
13467 @quotation 13471 @quotation
13468 A code page is a character set, which can include numbers, 13472 A code page is a character set, which can include numbers,
13469 punctuation marks, and other glyphs. Different languages and locales 13473 punctuation marks, and other glyphs. Different languages and locales
13470 may use different code pages. For example, ANSI code page 1252 is 13474 may use different code pages. For example, ANSI code page 1252 is
13502 13506
13503 -- The "C" locale is defined by ANSI to correspond to the locale in 13507 -- The "C" locale is defined by ANSI to correspond to the locale in
13504 which C programs have traditionally executed. The code page for the 13508 which C programs have traditionally executed. The code page for the
13505 "C" locale (code page) corresponds to the ASCII character 13509 "C" locale (code page) corresponds to the ASCII character
13506 set. For example, in the "C" locale, islower returns true for the 13510 set. For example, in the "C" locale, islower returns true for the
13507 values 0x61 ?0x7A only. In another locale, islower may return true 13511 values 0x61 to 0x7A only. In another locale, islower may return true
13508 for these as well as other values, as defined by that locale. 13512 for these as well as other values, as defined by that locale.
13509 13513
13510 Under "Locale-Dependent Routines" we notice the following setlocale 13514 Under ``Locale-Dependent Routines'' we notice the following setlocale
13511 dependencies: 13515 dependencies:
13512 13516
13513 atof, atoi, atol (LC_NUMERIC) 13517 atof, atoi, atol (LC_NUMERIC)
13514 is Routines (LC_CTYPE) 13518 is Routines (LC_CTYPE)
13515 isleadbyte (LC_CTYPE) 13519 isleadbyte (LC_CTYPE)
13538 wcstombs (LC_CTYPE) 13542 wcstombs (LC_CTYPE)
13539 wctomb (LC_CTYPE) 13543 wctomb (LC_CTYPE)
13540 _wtoi/_wtol (LC_NUMERIC) 13544 _wtoi/_wtol (LC_NUMERIC)
13541 @end quotation 13545 @end quotation
13542 13546
13543 NOTE: The above documentation doesn't clearly explain the "locale code 13547 NOTE: The above documentation doesn't clearly explain the ``locale code
13544 page" and "multibyte code page". These are two different values, 13548 page'' and ``multibyte code page''. These are two different values,
13545 maintained respectively in the CRT global variables __lc_codepage and 13549 maintained respectively in the CRT global variables __lc_codepage and
13546 __mbcodepage. Calling e.g. setlocale (LC_ALL, "JAPANESE") sets @strong{ONLY} 13550 __mbcodepage. Calling e.g. setlocale (LC_ALL, "JAPANESE") sets @strong{ONLY}
13547 __lc_codepage to 932 (the code page for Japanese), and leaves 13551 __lc_codepage to 932 (the code page for Japanese), and leaves
13548 __mbcodepage unchanged (usually 1252, i.e. Windows-ANSI). You'd have to 13552 __mbcodepage unchanged (usually 1252, i.e. Windows-ANSI). You'd have to
13549 call _setmbcp() to change __mbcodepage. Figuring out from the 13553 call _setmbcp() to change __mbcodepage. Figuring out from the
13550 documentation which routines use which code page is not so obvious. But: 13554 documentation which routines use which code page is not so obvious. But:
13551 13555
13552 @itemize @bullet 13556 @itemize @bullet
13553 @item 13557 @item
13554 from "Interpretation of Multibyte-Character Sequences" it appears that 13558 from ``Interpretation of Multibyte-Character Sequences'' it appears that
13555 all "multibyte-character routines" use the multibyte code page except for 13559 all ``multibyte-character routines'' use the multibyte code page except for
13556 mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), and wctomb(). 13560 @code{mblen()}, @code{_mbstrlen()}, @code{mbstowcs()}, @code{mbtowc()}, @code{wcstombs()}, and @code{wctomb()}.
13557 13561
13558 @item 13562 @item
13559 from "_setmbcp": "The multibyte code page also affects 13563 from ``_setmbcp'': ``The multibyte code page also affects
13560 multibyte-character processing by the following run-time library 13564 multibyte-character processing by the following run-time library
13561 routines: _exec functions _mktemp _stat _fullpath _spawn functions 13565 routines: _exec functions _mktemp _stat _fullpath _spawn functions
13562 _tempnam _makepath _splitpath tmpnam. In addition, all run-time library 13566 _tempnam _makepath _splitpath tmpnam. In addition, all run-time library
13563 routines that receive multibyte-character argv or envp program arguments 13567 routines that receive multibyte-character argv or envp program arguments
13564 as parameters (such as the _exec and _spawn families) process these 13568 as parameters (such as the _exec and _spawn families) process these
13565 strings according to the multibyte code page. Hence these routines are 13569 strings according to the multibyte code page. Hence these routines are
13566 also affected by a call to _setmbcp that changes the multibyte code 13570 also affected by a call to _setmbcp that changes the multibyte code
13567 page." 13571 page.''
13568 @end itemize 13572 @end itemize
13569 13573
13570 Summary: from looking at the CRT source (which comes with VC++) and 13574 Summary: from looking at the CRT source (which comes with VC++) and
13571 carefully looking through the docs, it appears that: 13575 carefully looking through the docs, it appears that:
13572 13576
13573 @itemize @bullet 13577 @itemize @bullet
13574 @item 13578 @item
13575 the "locale code page" is used by all of the routines listed above 13579 the ``locale code page'' is used by all of the routines listed above
13576 under "Locale-Dependent Routines" (EXCEPT _mbccpy() and _mbclen()), 13580 under ``Locale-Dependent Routines'' (EXCEPT @code{_mbccpy()} and @code{_mbclen()}),
13577 as well as any other place that converts between multibyte and Unicode 13581 as well as any other place that converts between multibyte and Unicode
13578 strings, e.g. the startup code. 13582 strings, e.g. the startup code.
13579 @item 13583 @item
13580 the "multibyte code page" is used in all of the *mb*() routines 13584 the ``multibyte code page'' is used in all of the @code{mb*()} routines
13581 except mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), 13585 except @code{mblen()}, @code{_mbstrlen()}, @code{mbstowcs()}, @code{mbtowc()}, @code{wcstombs()},
13582 and wctomb(); also _exec*(), _spawn*(), _mktemp(), _stat(), _fullpath(), 13586 and @code{wctomb()}; also @code{_exec*()}, @code{_spawn*()}, @code{_mktemp()}, @code{_stat()}, @code{_fullpath()},
13583 _tempnam(), _makepath(), _splitpath(), tmpnam(), and similar functions 13587 @code{_tempnam()}, @code{_makepath()}, @code{_splitpath()}, @code{tmpnam()}, and similar functions
13584 without the leading underscore. 13588 without the leading underscore.
13585 @end itemize 13589 @end itemize
13586 13590
13587 @node More about locales, Unicode support under Windows, More about code pages, Microsoft Windows-Related Multilingual Issues 13591 @node More about locales, Unicode support under Windows, More about code pages, Microsoft Windows-Related Multilingual Issues
13588 @subsection More about locales 13592 @subsection More about locales
13591 In addition to the locale defined by the CRT, Windows (i.e. the Win32 API) 13595 In addition to the locale defined by the CRT, Windows (i.e. the Win32 API)
13592 defines various locales: 13596 defines various locales:
13593 13597
13594 @itemize @bullet 13598 @itemize @bullet
13595 @item 13599 @item
13596 The system-default locale is the locale defined under "Language 13600 The system-default locale is the locale defined under ``Language
13597 settings for the system" in the "Regional Options" control panel. This 13601 settings for the system'' in the ``Regional Options'' control panel. This
13598 is NOT user-specific, and changing it requires a reboot (at least under 13602 is NOT user-specific, and changing it requires a reboot (at least under
13599 Windows 2000). The ANSI code page of the system-default locale is 13603 Windows 2000). The ANSI code page of the system-default locale is
13600 returned by GetACP(), and you can specify this code page in calls 13604 returned by @code{GetACP()}, and you can specify this code page in calls
13601 e.g. to MultiByteToWideChar with the constant CP_ACP. 13605 e.g. to MultiByteToWideChar with the constant CP_ACP.
13602 13606
13603 @item 13607 @item
13604 The user-default locale is the locale defined under "Settings for the 13608 The user-default locale is the locale defined under ``Settings for the
13605 current user" in the "Regional Options" control panel. 13609 current user'' in the ``Regional Options'' control panel.
13606 13610
13607 @item 13611 @item
13608 There is a thread-local locale set by SetThreadLocale. #### What is this 13612 There is a thread-local locale set by SetThreadLocale. #### What is this
13609 used for? 13613 used for?
13610 @end itemize 13614 @end itemize
13611 13615
13612 The Win32 API has a bunch of multibyte functions -- all of those that 13616 The Win32 API has a bunch of multibyte functions -- all of those that
13613 end with ...A(), and on which we spend so much effort in 13617 end with ...@code{A()}, and on which we spend so much effort in
13614 intl-encap-win32.c. These appear to ALWAYS use the ANSI code page of 13618 intl-encap-win32.c. These appear to ALWAYS use the ANSI code page of
13615 the system-default locale (GetACP(), CP_ACP). Note that this applies 13619 the system-default locale (@code{GetACP()}, CP_ACP). Note that this applies
13616 also, for example, to the encoding of filenames in all file-handling 13620 also, for example, to the encoding of filenames in all file-handling
13617 routines, including the CRT ones such as open(), because they pass their 13621 routines, including the CRT ones such as @code{open()}, because they pass their
13618 args unchanged to the Win32 API. 13622 args unchanged to the Win32 API.
13619 13623
13620 @node Unicode support under Windows, The golden rules of writing Unicode-safe code, More about locales, Microsoft Windows-Related Multilingual Issues 13624 @node Unicode support under Windows, The golden rules of writing Unicode-safe code, More about locales, Microsoft Windows-Related Multilingual Issues
13621 @subsection Unicode support under Windows 13625 @subsection Unicode support under Windows
13622 @cindex unicode support under windows 13626 @cindex unicode support under windows
13630 table to convert the characters of that code page to and from Unicode, and 13634 table to convert the characters of that code page to and from Unicode, and
13631 the Win32 API itself probably (perhaps always) uses Unicode internally. 13635 the Win32 API itself probably (perhaps always) uses Unicode internally.
13632 13636
13633 Under Windows there are two different versions of all library routines that 13637 Under Windows there are two different versions of all library routines that
13634 accept or return text, those that handle Unicode text and those handling 13638 accept or return text, those that handle Unicode text and those handling
13635 "multibyte" text, i.e. variable-width ASCII-compatible text in some 13639 ``multibyte'' text, i.e. variable-width ASCII-compatible text in some
13636 national format such as EUC or Shift-JIS. Because Windows 95 basically 13640 national format such as EUC or Shift-JIS. Because Windows 95 basically
13637 doesn't support Unicode but Windows NT does, and Microsoft doesn't provide 13641 doesn't support Unicode but Windows NT does, and Microsoft doesn't provide
13638 any way of writing a single binary that will work on both systems and still 13642 any way of writing a single binary that will work on both systems and still
13639 use Unicode when it's available (although see below, Microsoft Layer for 13643 use Unicode when it's available (although see below, Microsoft Layer for
13640 Unicode), we need to provide a way of run-time conditionalizing so you 13644 Unicode), we need to provide a way of run-time conditionalizing so you
13641 could have one binary for both systems. "Unicode-splitting" refers to 13645 could have one binary for both systems. ``Unicode-splitting'' refers to
13642 writing code that will handle this properly. This means using 13646 writing code that will handle this properly. This means using
13643 Qmswindows_tstr as the external conversion format, calling the appropriate 13647 Qmswindows_tstr as the external conversion format, calling the appropriate
13644 qxe...() Unicode-split version of library functions, and doing other things 13648 qxe...() Unicode-split version of library functions, and doing other things
13645 in certain cases, e.g. when a qxe() function is not present. 13649 in certain cases, e.g. when a @code{qxe()} function is not present.
13646 13650
13647 Unicode support also requires that the various Windows APIs be 13651 Unicode support also requires that the various Windows APIs be
13648 "Unicode-encapsulated", so that they automatically call the ANSI or 13652 ``Unicode-encapsulated'', so that they automatically call the ANSI or
13649 Unicode version of the API call appropriately and handle the size 13653 Unicode version of the API call appropriately and handle the size
13650 differences in structures. What this means is: 13654 differences in structures. What this means is:
13651 13655
13652 @itemize @bullet 13656 @itemize @bullet
13653 @item 13657 @item
13654 first, note that Windows already provides a sort of encapsulation 13658 first, note that Windows already provides a sort of encapsulation
13655 of all APIs that deal with text. All such APIs are underlyingly 13659 of all APIs that deal with text. All such APIs are underlyingly
13656 provided in two versions, with an A or W suffix (ANSI or "wide" 13660 provided in two versions, with an A or W suffix (ANSI or ``wide''
13657 i.e. Unicode), and the compile-time constant UNICODE controls which is 13661 i.e. Unicode), and the compile-time constant UNICODE controls which is
13658 selected by the unsuffixed API. Same thing happens with structures, and 13662 selected by the unsuffixed API. Same thing happens with structures, and
13659 also with types, where the generic types have names beginning with T -- 13663 also with types, where the generic types have names beginning with T --
13660 TCHAR, LPTSTR, etc.. Unfortunately, this is compile-time only, not 13664 TCHAR, LPTSTR, etc.. Unfortunately, this is compile-time only, not
13661 run-time, so not sufficient. (Creating the necessary run-time encoding 13665 run-time, so not sufficient. (Creating the necessary run-time encoding
13670 such an API available internally.) 13674 such an API available internally.)
13671 13675
13672 @item 13676 @item
13673 what we do is provide an encapsulation of each standard Windows API call 13677 what we do is provide an encapsulation of each standard Windows API call
13674 that is split into A and W versions. current theory is to avoid all 13678 that is split into A and W versions. current theory is to avoid all
13675 preprocessor games; so we name the function with a prefix -- "qxe" 13679 preprocessor games; so we name the function with a prefix -- ``qxe''
13676 currently -- and require callers to use the prefixed name. Callers need 13680 currently -- and require callers to use the prefixed name. Callers need
13677 to explicitly use the W version of all structures, and convert text 13681 to explicitly use the W version of all structures, and convert text
13678 themselves using Qmswindows_tstr. the qxe encapsulated version will 13682 themselves using Qmswindows_tstr. the qxe encapsulated version will
13679 automatically call the appropriate A or W version depending on whether 13683 automatically call the appropriate A or W version depending on whether
13680 we're running on 9x or NT (you can force use of the A calls on NT, 13684 we're running on 9x or NT (you can force use of the A calls on NT,
13730 purpose, to make the code easier to follow for someone who's not familiar 13734 purpose, to make the code easier to follow for someone who's not familiar
13731 with it. until our library is really complete and bug-free, we should 13735 with it. until our library is really complete and bug-free, we should
13732 think twice before doing this. 13736 think twice before doing this.
13733 13737
13734 According to Microsoft documentation, only the following functions are 13738 According to Microsoft documentation, only the following functions are
13735 provided under Windows 9x to support Unicode (see MSDN page "Windows 13739 provided under Windows 9x to support Unicode (see MSDN page ``Windows
13736 95/98/Me General Limitations"): 13740 95/98/Me General Limitations''):
13737 13741
13738 EnumResourceLanguagesW 13742 EnumResourceLanguagesW
13739 EnumResourceNamesW 13743 EnumResourceNamesW
13740 EnumResourceTypesW 13744 EnumResourceTypesW
13741 ExtTextOutW 13745 ExtTextOutW
13752 MessageBoxExW 13756 MessageBoxExW
13753 MultiByteToWideChar 13757 MultiByteToWideChar
13754 TextOutW 13758 TextOutW
13755 WideCharToMultiByte 13759 WideCharToMultiByte
13756 13760
13757 also maybe GetTextExtentExPoint? (KB Q125671 "Unicode Functions Supported 13761 also maybe GetTextExtentExPoint? (KB Q125671 ``Unicode Functions Supported
13758 by Windows 95") 13762 by Windows 95'')
13759 13763
13760 Q210341 says this in addition: 13764 Q210341 says this in addition:
13761 13765
13762 @quotation 13766 @quotation
13763 SUMMARY: 13767 SUMMARY:
13778 range beyond the 256 limitation of a one-byte representation. 13782 range beyond the 256 limitation of a one-byte representation.
13779 13783
13780 The Unicode standard offers application developers an opportunity to 13784 The Unicode standard offers application developers an opportunity to
13781 work with text without the limitations of character set based 13785 work with text without the limitations of character set based
13782 systems. For more information on the Unicode standard see the 13786 systems. For more information on the Unicode standard see the
13783 "References" section of this article. Windows NT is a fully Unicode 13787 References" section of this article. Windows NT is a fully Unicode
13784 capable operating system so it may be desirable to write software that 13788 capable operating system so it may be desirable to write software that
13785 supports Unicode on Windows 95. 13789 supports Unicode on Windows 95.
13786 13790
13787 Even though Windows 95 and Windows 98 are not Unicode based, they do 13791 Even though Windows 95 and Windows 98 are not Unicode based, they do
13788 provide some limited Unicode functionality. Drawing of Unicode text is 13792 provide some limited Unicode functionality. Drawing of Unicode text is
13861 @itemize @bullet 13865 @itemize @bullet
13862 @item 13866 @item
13863 wmain() is completely supported, and appropriate Unicode-formatted argv 13867 wmain() is completely supported, and appropriate Unicode-formatted argv
13864 and envp will always be passed. 13868 and envp will always be passed.
13865 @item 13869 @item
13866 Likewise, wWinMain() is completely supported. (NOTE: The docs are not at 13870 Likewise, @code{wWinMain()} is completely supported. (NOTE: The docs are not at
13867 all clear on how these various entry points interact, and implies that 13871 all clear on how these various entry points interact, and implies that
13868 a windows-subsystem program "must" use WinMain(), while a console- 13872 a windows-subsystem program ``must'' use @code{WinMain()}, while a console-
13869 subsystem program "must" use main(), and a program compiled with UNICODE 13873 subsystem program ``must'' use @code{main()}, and a program compiled with UNICODE
13870 (which we don't, see above) "must" use the w*() versions, while a program 13874 (which we don't, see above) ``must'' use the @code{w*()} versions, while a program
13871 not compiled this way "must" use the plain versions. In fact it appears 13875 not compiled this way ``must'' use the plain versions. In fact it appears
13872 that the CRT provides four different compiler entry points, namely 13876 that the CRT provides four different compiler entry points, namely
13873 w?(main|WinMain)CRTStartup, and we simply choose the one we like using 13877 w?(main|WinMain)CRTStartup, and we simply choose the one we like using
13874 the appropriate link flag. 13878 the appropriate link flag.
13875 @item 13879 @item
13876 _wenviron, _wputenv 13880 _wenviron, _wputenv
17948 boxes are not explicitly cleared and may contain junk. 17952 boxes are not explicitly cleared and may contain junk.
17949 17953
17950 @node The Frame, The Non-Client Area, Intro to Window and Frame Geometry, Window and Frame Geometry 17954 @node The Frame, The Non-Client Area, Intro to Window and Frame Geometry, Window and Frame Geometry
17951 @section The Frame 17955 @section The Frame
17952 17956
17953 The "top-level window area" is the entire area of a top-level window (or 17957 The ``top-level window area'' is the entire area of a top-level window (or
17954 "frame"). The "client area" (a term from MS Windows) is the area of a 17958 ``frame''). The ``client area'' (a term from MS Windows) is the area of a
17955 top-level window that XEmacs draws into and manages with redisplay. 17959 top-level window that XEmacs draws into and manages with redisplay.
17956 This includes the toolbar, scrollbars, gutters, dividers, text area, 17960 This includes the toolbar, scrollbars, gutters, dividers, text area,
17957 modeline and minibuffer. It does not include the menubar, title or 17961 modeline and minibuffer. It does not include the menubar, title or
17958 outer borders. The "non-client area" is the area of a top-level window 17962 outer borders. The ``non-client area'' is the area of a top-level window
17959 outside of the client area and includes the menubar, title and outer 17963 outside of the client area and includes the menubar, title and outer
17960 borders. Internally, all frame coordinates are relative to the client 17964 borders. Internally, all frame coordinates are relative to the client
17961 area. 17965 area.
17962 17966
17963 17967
17970 @item 17974 @item
17971 The outer layer is the window-manager decorations: The title and 17975 The outer layer is the window-manager decorations: The title and
17972 borders. These are controlled by the window manager, a separate process 17976 borders. These are controlled by the window manager, a separate process
17973 that controls the desktop, the location of icons, etc. When a process 17977 that controls the desktop, the location of icons, etc. When a process
17974 tries to create a window, the window manager intercepts this action and 17978 tries to create a window, the window manager intercepts this action and
17975 "reparents" the window, placing another window around it which contains 17979 ``reparents'' the window, placing another window around it which contains
17976 the window decorations, including the title bar, outer borders used for 17980 the window decorations, including the title bar, outer borders used for
17977 resizing, etc. The window manager also implements any actions involving 17981 resizing, etc. The window manager also implements any actions involving
17978 the decorations, such as the ability to resize a window by dragging its 17982 the decorations, such as the ability to resize a window by dragging its
17979 borders, move a window by dragging its title bar, etc. If there is no 17983 borders, move a window by dragging its title bar, etc. If there is no
17980 window manager or you kill it, windows will have no decorations (and 17984 window manager or you kill it, windows will have no decorations (and
17981 will lose them if they previously had any) and you will not be able to 17985 will lose them if they previously had any) and you will not be able to
17982 move or resize them. 17986 move or resize them.
17983 17987
17984 @item 17988 @item
17985 Inside of the window-manager decorations is the "shell", which is 17989 Inside of the window-manager decorations is the ``shell'', which is
17986 managed by the toolkit and widget libraries your program is linked with. 17990 managed by the toolkit and widget libraries your program is linked with.
17987 The code in @file{*-x.c} uses the Xt toolkit and various possible widget 17991 The code in @file{*-x.c} uses the Xt toolkit and various possible widget
17988 libraries built on top of Xt, such as Motif, Athena, the "Lucid" 17992 libraries built on top of Xt, such as Motif, Athena, the ``Lucid''
17989 widgets, etc. Another possibility is GTK (@file{*-gtk.c}), which implements 17993 widgets, etc. Another possibility is GTK (@file{*-gtk.c}), which implements
17990 both the toolkit and widgets. Under Xt, the "shell" window is an 17994 both the toolkit and widgets. Under Xt, the ``shell'' window is an
17991 EmacsShell widget, containing an EmacsManager widget of the same size, 17995 EmacsShell widget, containing an EmacsManager widget of the same size,
17992 which in turn contains a menubar widget and an EmacsFrame widget, inside 17996 which in turn contains a menubar widget and an EmacsFrame widget, inside
17993 of which is the client area. (The division into EmacsShell and 17997 of which is the client area. (The division into EmacsShell and
17994 EmacsManager is due to the complex and screwy geometry-management system 17998 EmacsManager is due to the complex and screwy geometry-management system
17995 in Xt [and X more generally]. The EmacsShell handles negotation with 17999 in Xt [and X more generally]. The EmacsShell handles negotation with
18001 18005
18002 Under Windows, the non-client area is managed by the window system. 18006 Under Windows, the non-client area is managed by the window system.
18003 There is no division such as under X. Part of the window-system API 18007 There is no division such as under X. Part of the window-system API
18004 (@file{USER.DLL}) of Win32 includes functions to control the menubars, title, 18008 (@file{USER.DLL}) of Win32 includes functions to control the menubars, title,
18005 etc. and implements the move and resize behavior. There @strong{is} an 18009 etc. and implements the move and resize behavior. There @strong{is} an
18006 equivalent of the window manager, called the "shell", but it manages 18010 equivalent of the window manager, called the ``shell'', but it manages
18007 only the desktop, not the windows themselves. The normal shell under 18011 only the desktop, not the windows themselves. The normal shell under
18008 Windows is @file{EXPLORER.EXE}; if you kill this, you will lose the bar 18012 Windows is @file{EXPLORER.EXE}; if you kill this, you will lose the bar
18009 containing the "Start" menu and tray and such, but the windows 18013 containing the ``Start'' menu and tray and such, but the windows
18010 themselves will not be affected or lose their decorations. 18014 themselves will not be affected or lose their decorations.
18011 18015
18012 18016
18013 @node The Client Area, The Paned Area, The Non-Client Area, Window and Frame Geometry 18017 @node The Client Area, The Paned Area, The Non-Client Area, Window and Frame Geometry
18014 @section The Client Area 18018 @section The Client Area
18015 18019
18016 Inside of the client area is the toolbars, the gutters (where the buffer 18020 Inside of the client area is the toolbars, the gutters (where the buffer
18017 tabs are displayed), the minibuffer, the internal border width, and one 18021 tabs are displayed), the minibuffer, the internal border width, and one
18018 or more non-overlapping "windows" (this is old Emacs terminology, from 18022 or more non-overlapping ``windows'' (this is old Emacs terminology, from
18019 before the time when frames existed at all; the standard terminology for 18023 before the time when frames existed at all; the standard terminology for
18020 this would be "pane"). Each window can contain a modeline, horizontal 18024 this would be ``pane''). Each window can contain a modeline, horizontal
18021 and/or vertical scrollbars, and (for non-rightmost windows) a vertical 18025 and/or vertical scrollbars, and (for non-rightmost windows) a vertical
18022 divider, surrounding a text area. 18026 divider, surrounding a text area.
18023 18027
18024 The dimensions of the toolbars and gutters are determined by the formula 18028 The dimensions of the toolbars and gutters are determined by the formula
18025 (THICKNESS + 2 * BORDER-THICKNESS), where "thickness" is a cover term 18029 (THICKNESS + 2 * BORDER-THICKNESS), where ``thickness'' is a cover term
18026 for height or width, as appropriate. The height and width come from 18030 for height or width, as appropriate. The height and width come from
18027 @code{default-toolbar-height} and @code{default-toolbar-width} and the specific 18031 @code{default-toolbar-height} and @code{default-toolbar-width} and the specific
18028 versions of these (@code{top-toolbar-height}, @code{left-toolbar-width}, etc.). 18032 versions of these (@code{top-toolbar-height}, @code{left-toolbar-width}, etc.).
18029 The border thickness comes from @code{default-toolbar-border-height} and 18033 The border thickness comes from @code{default-toolbar-border-height} and
18030 @code{default-toolbar-border-width}, and the specific versions of these. The 18034 @code{default-toolbar-border-width}, and the specific versions of these. The
18045 18049
18046 18050
18047 @node The Paned Area, Text Areas, The Client Area, Window and Frame Geometry 18051 @node The Paned Area, Text Areas, The Client Area, Window and Frame Geometry
18048 @section The Paned Area 18052 @section The Paned Area
18049 18053
18050 The area occupied by the "windows" is called the paned area. 18054 The area occupied by the ``windows'' is called the paned area.
18051 Unfortunately, because of the presence of the gutter @strong{between} the 18055 Unfortunately, because of the presence of the gutter @strong{between} the
18052 minibuffer and other windows, the bottom of the paned area is not 18056 minibuffer and other windows, the bottom of the paned area is not
18053 well-defined -- does it include the minibuffer (in which case it also 18057 well-defined -- does it include the minibuffer (in which case it also
18054 includes the bottom gutter, but none others) or does it not include 18058 includes the bottom gutter, but none others) or does it not include
18055 the minibuffer? (In which case not all windows are included.) It would 18059 the minibuffer? (In which case not all windows are included.) It would
18080 @code{horizontal-scrollbar-visible-p}, @code{vertical-scrollbar-visible-p}, 18084 @code{horizontal-scrollbar-visible-p}, @code{vertical-scrollbar-visible-p},
18081 @code{vertical-divider-always-visible-p}, etc. 18085 @code{vertical-divider-always-visible-p}, etc.
18082 18086
18083 In addition, it is possible to set margins in the text area using the 18087 In addition, it is possible to set margins in the text area using the
18084 specifiers @code{left-margin-width} and @code{right-margin-width}. When this is 18088 specifiers @code{left-margin-width} and @code{right-margin-width}. When this is
18085 done, only the "inner text area" (the area inside of the margins) will 18089 done, only the ``inner text area'' (the area inside of the margins) will
18086 be used for normal display of text; the margins will be used for glyphs 18090 be used for normal display of text; the margins will be used for glyphs
18087 with a layout policy of @code{outside-margin} (as set on an extent containing 18091 with a layout policy of @code{outside-margin} (as set on an extent containing
18088 the glyph by @code{set-extent-begin-glyph-layout} or 18092 the glyph by @code{set-extent-begin-glyph-layout} or
18089 @code{set-extent-end-glyph-layout}). However, the calculation of the text 18093 @code{set-extent-end-glyph-layout}). However, the calculation of the text
18090 area size (e.g. in the function @code{window-text-area-width}) includes the 18094 area size (e.g. in the function @code{window-text-area-width}) includes the
18091 margins. Which margin is used depends on whether a glyph has been set 18095 margins. Which margin is used depends on whether a glyph has been set
18092 as the begin-glyph or end-glyph of an extent (@code{set-extent-begin-glyph} 18096 as the begin-glyph or end-glyph of an extent (@code{set-extent-begin-glyph}
18093 etc.), using the left and right margins, respectively. 18097 etc.), using the left and right margins, respectively.
18094 18098
18095 Technically, the margins outside of the inner text area are known as the 18099 Technically, the margins outside of the inner text area are known as the
18096 "outside margins". The "inside margins" are in the inner text area and 18100 ``outside margins''. The ``inside margins'' are in the inner text area and
18097 constitute the whitespace between the outside margins and the first or 18101 constitute the whitespace between the outside margins and the first or
18098 last non-whitespace character in a line; their width can vary from line 18102 last non-whitespace character in a line; their width can vary from line
18099 to line. Glyphs will be placed in the inside margin if their layout 18103 to line. Glyphs will be placed in the inside margin if their layout
18100 policy is @code{inside-margin} or @code{whitespace}, with @code{whitespace} glyphs on 18104 policy is @code{inside-margin} or @code{whitespace}, with @code{whitespace} glyphs on
18101 the inside and @code{inside-margin} glyphs on the outside. Inside-margin 18105 the inside and @code{inside-margin} glyphs on the outside. Inside-margin
18106 18110
18107 18111
18108 @node The Displayable Area, Which Functions Use Which?, Text Areas, Window and Frame Geometry 18112 @node The Displayable Area, Which Functions Use Which?, Text Areas, Window and Frame Geometry
18109 @section The Displayable Area 18113 @section The Displayable Area
18110 18114
18111 The "displayable area" is not so much an actual area as a convenient 18115 The ``displayable area'' is not so much an actual area as a convenient
18112 fiction. It is the area used to convert between pixel and character 18116 fiction. It is the area used to convert between pixel and character
18113 dimensions for frames. The character dimensions for a frame (e.g. as 18117 dimensions for frames. The character dimensions for a frame (e.g. as
18114 returned by @code{frame-width} and @code{frame-height} and set by 18118 returned by @code{frame-width} and @code{frame-height} and set by
18115 @code{set-frame-width} and @code{set-frame-height}) are determined from the 18119 @code{set-frame-width} and @code{set-frame-height}) are determined from the
18116 displayable area by dividing by the pixel size of the default font as 18120 displayable area by dividing by the pixel size of the default font as
18117 instantiated in the frame. (For proportional fonts, the "average" width 18121 instantiated in the frame. (For proportional fonts, the ``average'' width
18118 is used. Under Windows, this is a built-in property of the fonts. 18122 is used. Under Windows, this is a built-in property of the fonts.
18119 Under X, this is based on the width of the lowercase 'n', or if this is 18123 Under X, this is based on the width of the lowercase 'n', or if this is
18120 zero then the width of the default character. [We prefer 'n' to the 18124 zero then the width of the default character. [We prefer 'n' to the
18121 specified default character because many X fonts have a default 18125 specified default character because many X fonts have a default
18122 character with a zero or otherwise non-representative width.]) 18126 character with a zero or otherwise non-representative width.])
18123 18127
18124 The displayable area is essentially the "theoretical" gutter area of the 18128 The displayable area is essentially the ``theoretical'' gutter area of the
18125 frame, excluding the rightmost and bottom-most scrollbars. That is, it 18129 frame, excluding the rightmost and bottom-most scrollbars. That is, it
18126 starts from the client (or "total") area and then excludes the 18130 starts from the client (or ``total'') area and then excludes the
18127 "theoretical" toolbars and bottom-most/rightmost scrollbars, and the 18131 ``theoretical'' toolbars and bottom-most/rightmost scrollbars, and the
18128 internal border width. In this context, "theoretical" means that all 18132 internal border width. In this context, ``theoretical'' means that all
18129 calculations on based on frame-level values for toolbar and scrollbar 18133 calculations on based on frame-level values for toolbar and scrollbar
18130 thicknesses. Because these thicknesses are controlled by specifiers, 18134 thicknesses. Because these thicknesses are controlled by specifiers,
18131 and specifiers can have window-specific and buffer-specific values, 18135 and specifiers can have window-specific and buffer-specific values,
18132 these calculations may or may not reflect the actual size of the paned 18136 these calculations may or may not reflect the actual size of the paned
18133 area or of the scrollbars when any particular window is selected. Note 18137 area or of the scrollbars when any particular window is selected. Note
18134 also that the "displayable area" may not even be contiguous! In 18138 also that the ``displayable area'' may not even be contiguous! In
18135 particular, the gutters are included, but the bottom-most and rightmost 18139 particular, the gutters are included, but the bottom-most and rightmost
18136 scrollbars are excluded even though they are inside of the gutters. 18140 scrollbars are excluded even though they are inside of the gutters.
18137 Furthermore, if the frame-level value of the horizontal scrollbar height 18141 Furthermore, if the frame-level value of the horizontal scrollbar height
18138 is non-zero, then the displayable area includes the paned area above and 18142 is non-zero, then the displayable area includes the paned area above and
18139 below the bottom horizontal scrollbar (i.e. the modeline and minibuffer) 18143 below the bottom horizontal scrollbar (i.e. the modeline and minibuffer)
18148 width before dividing by the default-font width, and then adding 1 to 18152 width before dividing by the default-font width, and then adding 1 to
18149 the result.) (The ultimate motivation for this kludge as well as the 18153 the result.) (The ultimate motivation for this kludge as well as the
18150 subtraction of the scrollbars, but not the minibuffer or bottom-most 18154 subtraction of the scrollbars, but not the minibuffer or bottom-most
18151 modeline, is to maintain compatibility with TTY's.) 18155 modeline, is to maintain compatibility with TTY's.)
18152 18156
18153 Despite all these concerns and kludges, however, the "displayable area" 18157 Despite all these concerns and kludges, however, the ``displayable area''
18154 concept works well in practice and mostly ensures that by default the 18158 concept works well in practice and mostly ensures that by default the
18155 frame will actually fit 79 characters + continuation/truncation glyph. 18159 frame will actually fit 79 characters + continuation/truncation glyph.
18156 18160
18157 18161
18158 @node Which Functions Use Which?, , The Displayable Area, Window and Frame Geometry 18162 @node Which Functions Use Which?, , The Displayable Area, Window and Frame Geometry
19797 @section Event Queues 19801 @section Event Queues
19798 @cindex event queues 19802 @cindex event queues
19799 @cindex queues, event 19803 @cindex queues, event
19800 19804
19801 There are two event queues here -- the command event queue (#### which 19805 There are two event queues here -- the command event queue (#### which
19802 should be called "deferred event queue" and is in my glyph ws) and the 19806 should be called ``deferred event queue'' and is in my glyph ws) and the
19803 dispatch event queue. (MS Windows actually has an extra dispatch queue 19807 dispatch event queue. (MS Windows actually has an extra dispatch queue
19804 for non-user events and uses the generic one only for user events. This 19808 for non-user events and uses the generic one only for user events. This
19805 is because user and non-user events in Windows come through the same 19809 is because user and non-user events in Windows come through the same
19806 place -- the window procedure -- but under X, it's possible to 19810 place -- the window procedure -- but under X, it's possible to
19807 selectively process events such that we take all the user events before 19811 selectively process events such that we take all the user events before
19902 19906
19903 @item handle_magic_event_cb 19907 @item handle_magic_event_cb
19904 XEmacs calls this with an event structure which contains window-system 19908 XEmacs calls this with an event structure which contains window-system
19905 dependent information that XEmacs doesn't need to know about, but which 19909 dependent information that XEmacs doesn't need to know about, but which
19906 must happen in order. If the @code{next_event_cb} never returns an 19910 must happen in order. If the @code{next_event_cb} never returns an
19907 event of type "magic", this will never be used. 19911 event of type ``magic'', this will never be used.
19908 19912
19909 @item format_magic_event_cb 19913 @item format_magic_event_cb
19910 Called with a magic event; print a representation of the innards of the 19914 Called with a magic event; print a representation of the innards of the
19911 event to @var{PSTREAM}. 19915 event to @var{PSTREAM}.
19912 19916
19934 @item select_process_cb 19938 @item select_process_cb
19935 @item unselect_process_cb 19939 @item unselect_process_cb
19936 These callbacks tell the underlying implementation to add or remove a 19940 These callbacks tell the underlying implementation to add or remove a
19937 file descriptor from the list of fds which are polled for 19941 file descriptor from the list of fds which are polled for
19938 inferior-process input. When input becomes available on the given 19942 inferior-process input. When input becomes available on the given
19939 process connection, an event of type "process" should be generated. 19943 process connection, an event of type ``process'' should be generated.
19940 19944
19941 @item select_console_cb 19945 @item select_console_cb
19942 @item unselect_console_cb 19946 @item unselect_console_cb
19943 These callbacks tell the underlying implementation to add or remove a 19947 These callbacks tell the underlying implementation to add or remove a
19944 console from the list of consoles which are polled for user-input. 19948 console from the list of consoles which are polled for user-input.
20062 @cindex focus handling 20066 @cindex focus handling
20063 20067
20064 Ben's capsule lecture on focus: 20068 Ben's capsule lecture on focus:
20065 20069
20066 In GNU Emacs @code{select-frame} never changes the window-manager frame 20070 In GNU Emacs @code{select-frame} never changes the window-manager frame
20067 focus. All it does is change the "selected frame". This is similar to 20071 focus. All it does is change the ``selected frame''. This is similar to
20068 what happens when we call @code{select-device} or @code{select-console}. 20072 what happens when we call @code{select-device} or @code{select-console}.
20069 Whenever an event comes in (including a keyboard event), its frame is 20073 Whenever an event comes in (including a keyboard event), its frame is
20070 selected; therefore, evaluating @code{select-frame} in @samp{*scratch*} 20074 selected; therefore, evaluating @code{select-frame} in @samp{*scratch*}
20071 won't cause any effects because the next received event (in the same 20075 won't cause any effects because the next received event (in the same
20072 frame) will cause a switch back to the frame displaying 20076 frame) will cause a switch back to the frame displaying
20097 minibuffer, you essentially want to temporarily switch the WM focus to 20101 minibuffer, you essentially want to temporarily switch the WM focus to
20098 the frame with the minibuffer, and switch it back when you exit the 20102 the frame with the minibuffer, and switch it back when you exit the
20099 minibuffer. 20103 minibuffer.
20100 20104
20101 GNU Emacs solves this with the crockish @code{redirect-frame-focus}, 20105 GNU Emacs solves this with the crockish @code{redirect-frame-focus},
20102 which says "for keyboard events received from FRAME, act like they're 20106 which says ``for keyboard events received from FRAME, act like they're
20103 coming from FOCUS-FRAME". I think what this means is that, when a 20107 coming from FOCUS-FRAME''. I think what this means is that, when a
20104 keyboard event comes in and the event manager is about to select the 20108 keyboard event comes in and the event manager is about to select the
20105 event's frame, if that frame has its focus redirected, the redirected-to 20109 event's frame, if that frame has its focus redirected, the redirected-to
20106 frame is selected instead. That way, if you're in a minibufferless 20110 frame is selected instead. That way, if you're in a minibufferless
20107 frame and enter the minibuffer, then all Lisp functions that run see the 20111 frame and enter the minibuffer, then all Lisp functions that run see the
20108 selected frame as the minibuffer's frame rather than the minibufferless 20112 selected frame as the minibuffer's frame rather than the minibufferless
20112 There's also some weird logic that switches the redirected frame focus 20116 There's also some weird logic that switches the redirected frame focus
20113 from one frame to another if Lisp code explicitly calls 20117 from one frame to another if Lisp code explicitly calls
20114 @code{select-frame} (but not if @code{handle-switch-frame} is called), 20118 @code{select-frame} (but not if @code{handle-switch-frame} is called),
20115 and saves and restores the frame focus in window configurations, 20119 and saves and restores the frame focus in window configurations,
20116 etc. etc. All of this logic is heavily @code{#if 0}'d, with lots of 20120 etc. etc. All of this logic is heavily @code{#if 0}'d, with lots of
20117 comments saying "No, this approach doesn't seem to work, so I'm trying 20121 comments saying ``No, this approach doesn't seem to work, so I'm trying
20118 this ... is it reasonable? Well, I'm not sure ..." that are a red flag 20122 this ... is it reasonable? Well, I'm not sure ...'' that are a red flag
20119 indicating crockishness. 20123 indicating crockishness.
20120 20124
20121 Because of our way of doing things, we can avoid all this crock. 20125 Because of our way of doing things, we can avoid all this crock.
20122 Keyboard events never cause a select-frame (who cares what frame they're 20126 Keyboard events never cause a select-frame (who cares what frame they're
20123 associated with? They come from a console, only). We change the actual 20127 associated with? They come from a console, only). We change the actual
24896 return value should be an alist consisting of a list of all of the 24900 return value should be an alist consisting of a list of all of the
24897 defined subtypes for that coding system type along with a level of 24901 defined subtypes for that coding system type along with a level of
24898 likelihood and a list of additional properties indicating certain 24902 likelihood and a list of additional properties indicating certain
24899 features detected in the data. The extra properties returned are 24903 features detected in the data. The extra properties returned are
24900 defined entirely by the particular coding system type and are used 24904 defined entirely by the particular coding system type and are used
24901 only in the algorithm described below under "user control." However, 24905 only in the algorithm described below under ``user control.'' However,
24902 the levels of likelihood have a standard meaning as follows: 24906 the levels of likelihood have a standard meaning as follows:
24903 24907
24904 Level 4 means "near certainty" and typically indicates that a 24908 Level 4 means ``near certainty'' and typically indicates that a
24905 signature has been detected, usually at the beginning of the data, 24909 signature has been detected, usually at the beginning of the data,
24906 indicating that the data is encoded in this particular coding system 24910 indicating that the data is encoded in this particular coding system
24907 type. An example of this would be the byte order mark at the beginning 24911 type. An example of this would be the byte order mark at the beginning
24908 of UCS2 encoded data or the GZIP mark at the beginning of GZIP data. 24912 of UCS2 encoded data or the GZIP mark at the beginning of GZIP data.
24909 24913
24910 Level 3 means "highly likely" and indicates that tell-tale signs have 24914 Level 3 means ``highly likely'' and indicates that tell-tale signs have
24911 been discovered in the data that are characteristic of this particular 24915 been discovered in the data that are characteristic of this particular
24912 coding system type. Examples of this might be ISO 2022 escape 24916 coding system type. Examples of this might be ISO 2022 escape
24913 sequences or the current Unicode end of line markers at regular 24917 sequences or the current Unicode end of line markers at regular
24914 intervals. 24918 intervals.
24915 24919
24916 Level 2 means "strongly statistically likely" indicating that 24920 Level 2 means ``strongly statistically likely'' indicating that
24917 statistical analysis concludes that there's a high chance that this 24921 statistical analysis concludes that there's a high chance that this
24918 data is encoded according to this particular type. For example, this 24922 data is encoded according to this particular type. For example, this
24919 might mean that for UCS2 data, there is a high proportion of null bytes 24923 might mean that for UCS2 data, there is a high proportion of null bytes
24920 or other repeated bytes in the odd-numbered bytes of the data and a 24924 or other repeated bytes in the odd-numbered bytes of the data and a
24921 high variance in the even-numbered bytes of the data. For Shift-JIS, 24925 high variance in the even-numbered bytes of the data. For Shift-JIS,
24922 this might indicate that there were no illegal Shift-JIS sequences 24926 this might indicate that there were no illegal Shift-JIS sequences
24923 and a fairly high occurrence of common Shift-JIS characters. 24927 and a fairly high occurrence of common Shift-JIS characters.
24924 24928
24925 Level 1 means "weak statistical likelihood" meaning that there is some 24929 Level 1 means ``weak statistical likelihood'' meaning that there is some
24926 indication that the data is encoded in this coding system type. In 24930 indication that the data is encoded in this coding system type. In
24927 fact, there is a reasonable chance that it may be some other type as 24931 fact, there is a reasonable chance that it may be some other type as
24928 well. This means, for example, that no illegal sequences were 24932 well. This means, for example, that no illegal sequences were
24929 encountered and at least some data was encountered that is purposely 24933 encountered and at least some data was encountered that is purposely
24930 not in other coding system types. For Shift-JIS data, this might mean 24934 not in other coding system types. For Shift-JIS data, this might mean
24931 that some bytes in the range 128 to 159 were encountered in the data. 24935 that some bytes in the range 128 to 159 were encountered in the data.
24932 24936
24933 Level 0 means "neutral" which is to say that there's either not enough 24937 Level 0 means ``neutral'' which is to say that there's either not enough
24934 data to make any decision or that the data could well be interpreted 24938 data to make any decision or that the data could well be interpreted
24935 as this type (meaning no illegal sequences), but there is little or no 24939 as this type (meaning no illegal sequences), but there is little or no
24936 indication of anything particular to this particular type. 24940 indication of anything particular to this particular type.
24937 24941
24938 Level -1 means "weakly unlikely" meaning that some data was 24942 Level -1 means ``weakly unlikely'' meaning that some data was
24939 encountered that could conceivably be part of the coding system type 24943 encountered that could conceivably be part of the coding system type
24940 but is probably not. For example, successively long line-lengths or 24944 but is probably not. For example, successively long line-lengths or
24941 very rarely-encountered sequences. 24945 very rarely-encountered sequences.
24942 24946
24943 Level -2 means "strongly unlikely" meaning that typically a number 24947 Level -2 means ``strongly unlikely'' meaning that typically a number
24944 of illegal sequences were encountered. 24948 of illegal sequences were encountered.
24945 24949
24946 The algorithm to determine when to stop and indicate that the data has 24950 The algorithm to determine when to stop and indicate that the data has
24947 been detected as a particular coding system uses a priority list, 24951 been detected as a particular coding system uses a priority list,
24948 which is typically specified as part of the language environment 24952 which is typically specified as part of the language environment
24957 Japanese-language environment particular subtypes of ISO 2022 will be 24961 Japanese-language environment particular subtypes of ISO 2022 will be
24958 associated with the Japanese coding system version of those 24962 associated with the Japanese coding system version of those
24959 subtypes). It is perfectly legal and quite common in fact, to list the 24963 subtypes). It is perfectly legal and quite common in fact, to list the
24960 same subtype more than once in the priority list with successively 24964 same subtype more than once in the priority list with successively
24961 lower requirements. Other facts that can be listed in the priority 24965 lower requirements. Other facts that can be listed in the priority
24962 list for a subtype are "reject", meaning that the data should never be 24966 list for a subtype are ``reject'', meaning that the data should never be
24963 detected as this subtype, or "ask", meaning that if the data is 24967 detected as this subtype, or ``ask'', meaning that if the data is
24964 detected to be this subtype, the user will be asked whether they 24968 detected to be this subtype, the user will be asked whether they
24965 actually mean this. This latter property could be used, for example, 24969 actually mean this. This latter property could be used, for example,
24966 towards the bottom of the priority list. 24970 towards the bottom of the priority list.
24967 24971
24968 In addition there is a global variable which specifies the minimum 24972 In addition there is a global variable which specifies the minimum
24975 system, the subtype, the coding system and the associated level of 24979 system, the subtype, the coding system and the associated level of
24976 likelihood will be prominently displayed either in the echo area or in 24980 likelihood will be prominently displayed either in the echo area or in
24977 a status box somewhere. 24981 a status box somewhere.
24978 24982
24979 If no positive match is found according to the priority list, or if 24983 If no positive match is found according to the priority list, or if
24980 the matches that are found have the "ask" property on them, then the 24984 the matches that are found have the ``ask'' property on them, then the
24981 user will be presented with a list of choices of possible encodings 24985 user will be presented with a list of choices of possible encodings
24982 and asked to choose one. This list is typically sorted first by level 24986 and asked to choose one. This list is typically sorted first by level
24983 of likelihood, and then within this, by the order in which the 24987 of likelihood, and then within this, by the order in which the
24984 subtypes appear in the priority list. This list is displayed in a 24988 subtypes appear in the priority list. This list is displayed in a
24985 special kind of dialog box or other buffer allowing the user, in 24989 special kind of dialog box or other buffer allowing the user, in
24992 will be in the form of errors or warnings of various levels, some of 24996 will be in the form of errors or warnings of various levels, some of
24993 which may be severe enough to stop the decoding entirely, and some of 24997 which may be severe enough to stop the decoding entirely, and some of
24994 which may either indicate definitely malformed data but from which 24998 which may either indicate definitely malformed data but from which
24995 it's possible to recover, or simply data that appears rather 24999 it's possible to recover, or simply data that appears rather
24996 questionable. If any of these status values are reported during 25000 questionable. If any of these status values are reported during
24997 decoding, the user will be informed of this and asked "are you sure?" 25001 decoding, the user will be informed of this and asked ``are you sure?''
24998 As part of the "are you sure" dialog box or question, the user can 25002 As part of the ``are you sure'' dialog box or question, the user can
24999 display the results of the decoding to make sure it's correct. If the 25003 display the results of the decoding to make sure it's correct. If the
25000 user says "no, they're not sure," then the same list of choices as 25004 user says ``no, they're not sure,'' then the same list of choices as
25001 previously mentioned will be presented. 25005 previously mentioned will be presented.
25002 25006
25003 @subheading RFC: Autodetection 25007 @subheading RFC: Autodetection
25004 25008
25005 Also appeared under heading "Implementation of Coding System Priority 25009 Also appeared under heading "Implementation of Coding System Priority
25215 25219
25216 @enumerate 25220 @enumerate
25217 @item 25221 @item
25218 Hopefully a system general enough to handle (2)--(4) will 25222 Hopefully a system general enough to handle (2)--(4) will
25219 handle these, too, but we should watch out for gotchas like 25223 handle these, too, but we should watch out for gotchas like
25220 Unicode "plane 14" tags which (I think _both_ Ben and Olivier 25224 Unicode ``plane 14'' tags which (I think _both_ Ben and Olivier
25221 will agree) have no place in the internal representation, and 25225 will agree) have no place in the internal representation, and
25222 thus must be treated as out-of-band control sequences. I 25226 thus must be treated as out-of-band control sequences. I
25223 don't know if all such gotchas will be as easy to dispose of. 25227 don't know if all such gotchas will be as easy to dispose of.
25224 25228
25225 @item 25229 @item
25256 25260
25257 sly, it can't be perfect if any autodecoding is done; 25261 sly, it can't be perfect if any autodecoding is done;
25258 like Hrvoje should have an easily available option to 25262 like Hrvoje should have an easily available option to
25259 to this default (or an optimized approximation which 25263 to this default (or an optimized approximation which
25260 t actually read the whole file into a buffer) or simply 25264 t actually read the whole file into a buffer) or simply
25261 y everything as binary (with the "font" for binary files 25265 y everything as binary (with the ``font'' for binary files
25262 a user option). 25266 a user option).
25263 25267
25264 @item 25268 @item
25265 This implies that we should be detecting conditions in the 25269 This implies that we should be detecting conditions in the
25266 tail of the file which violate the implicit assumptions of the 25270 tail of the file which violate the implicit assumptions of the
25365 25369
25366 Date: 11/1/1999 7:24 AM 25370 Date: 11/1/1999 7:24 AM
25367 25371
25368 Stephen, thank you very much for writing this up. I think it is a good start, 25372 Stephen, thank you very much for writing this up. I think it is a good start,
25369 and definitely moving in the direction I would like to see things going: more 25373 and definitely moving in the direction I would like to see things going: more
25370 proposals, less arguing. (aka "more light, less heat") However, I have some 25374 proposals, less arguing. (aka ``more light, less heat'') However, I have some
25371 suggestions for cleaning this up: 25375 suggestions for cleaning this up:
25372 25376
25373 You should try to make it more layered. For example, you might have one 25377 You should try to make it more layered. For example, you might have one
25374 section devoted to the workings of autodetection, which starts out like this 25378 section devoted to the workings of autodetection, which starts out like this
25375 (the section numbers below are totally arbitrary): 25379 (the section numbers below are totally arbitrary):