Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 2640:a4040d921acc
[xemacs-hg @ 2005-03-09 05:36:28 by stephent]
internals and lispref <871xapfkkq.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Wed, 09 Mar 2005 05:36:50 +0000 |
parents | ecf1ebac70d8 |
children | d5bfa26d5c3f |
comparison
equal
deleted
inserted
replaced
2639:cd00e5eeb22a | 2640:a4040d921acc |
---|---|
1763 @menu | 1763 @menu |
1764 * CCL Syntax:: CCL program syntax in BNF notation. | 1764 * CCL Syntax:: CCL program syntax in BNF notation. |
1765 * CCL Statements:: Semantics of CCL statements. | 1765 * CCL Statements:: Semantics of CCL statements. |
1766 * CCL Expressions:: Operators and expressions in CCL. | 1766 * CCL Expressions:: Operators and expressions in CCL. |
1767 * Calling CCL:: Running CCL programs. | 1767 * Calling CCL:: Running CCL programs. |
1768 * CCL Examples:: The encoding functions for Big5 and KOI-8. | 1768 * CCL Example:: A trivial program to transform the Web's URL encoding. |
1769 @end menu | 1769 @end menu |
1770 | 1770 |
1771 @node CCL Syntax, CCL Statements, , CCL | 1771 @node CCL Syntax, CCL Statements, , CCL |
1772 @comment Node, Next, Previous, Up | 1772 @comment Node, Next, Previous, Up |
1773 @subsection CCL Syntax | 1773 @subsection CCL Syntax |
1984 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a | 1984 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a |
1985 complicated transformation of the Japanese standard JIS encoding to | 1985 complicated transformation of the Japanese standard JIS encoding to |
1986 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to | 1986 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to |
1987 represent the SJIS operations in infix form. | 1987 represent the SJIS operations in infix form. |
1988 | 1988 |
1989 @node Calling CCL, CCL Examples, CCL Expressions, CCL | 1989 @node Calling CCL, CCL Example, CCL Expressions, CCL |
1990 @comment Node, Next, Previous, Up | 1990 @comment Node, Next, Previous, Up |
1991 @subsection Calling CCL | 1991 @subsection Calling CCL |
1992 | 1992 |
1993 CCL programs are called automatically during Emacs buffer I/O when the | 1993 CCL programs are called automatically during Emacs buffer I/O when the |
1994 external representation has a coding system type of @code{shift-jis}, | 1994 external representation has a coding system type of @code{shift-jis}, |
2050 | 2050 |
2051 @defun ccl-reset-elapsed-time | 2051 @defun ccl-reset-elapsed-time |
2052 Resets the CCL interpreter's internal elapsed time registers. | 2052 Resets the CCL interpreter's internal elapsed time registers. |
2053 @end defun | 2053 @end defun |
2054 | 2054 |
2055 @node CCL Examples, , Calling CCL, CCL | 2055 @node CCL Example, , Calling CCL, CCL |
2056 @comment Node, Next, Previous, Up | 2056 @comment Node, Next, Previous, Up |
2057 @subsection CCL Examples | 2057 @subsection CCL Example |
2058 | 2058 |
2059 This section is not yet written. | 2059 In this section, we describe the implementation of a trivial coding |
2060 system to transform from the Web's URL encoding to XEmacs' internal | |
2061 coding. Many people will have been first exposed to URL encoding when | |
2062 they saw ``%20'' where they expected a space in a file's name on their | |
2063 local hard disk; this can happen when a browser saves a file from the | |
2064 web and doesn't encode the name, as passed from the server, properly. | |
2065 | |
2066 URL encoding itself is underspecified with regard to encodings beyond | |
2067 ASCII. The relevant document, RFC 1738, explicitly doesn't give any | |
2068 information on how to encode non-ASCII characters, and the ``obvious'' | |
2069 way---use the %xx values for the octets of the eight bit MIME character | |
2070 set in which the page was served---breaks when a user types a character | |
2071 outside that character set. Best practice for web development is to | |
2072 serve all pages as UTF-8 and treat incoming form data as using that | |
2073 coding system. (Oh, and gamble that your clients won't ever want to | |
2074 type anything outside Unicode. But that's not so much of a gamble with | |
2075 today's client operating systems.) We don't treat non-ASCII in this | |
2076 example, as dealing with @samp{(read-multibyte-character ...)} and | |
2077 errors therewith would make it much harder to understand. | |
2078 | |
2079 Since CCL isn't a very rich language, we move much of the logic that | |
2080 would ordinarily be computed from operations like @code{(member ..)}, | |
2081 @code{(and ...)} and @code{(or ...)} into tables, from which register | |
2082 values are read and written, and on which @code{if} statements are | |
2083 predicated. Much more of the implementation of this coding system is | |
2084 occupied with constructing these tables---in normal Emacs Lisp---than it | |
2085 is with actual CCL code. | |
2086 | |
2087 All the @code{defvar} statements we deal with in the next few sections | |
2088 are surrounded by a @code{(eval-and-compile ...)}, which means that the | |
2089 logic which initializes these variables executes at compile time, and if | |
2090 XEmacs loads the compiled version of the file, these variables are | |
2091 initialized as constants. | |
2092 | |
2093 @menu | |
2094 * Four bits to ASCII:: Two tables used for getting hex digits from ASCII. | |
2095 * URI Encoding constants:: Useful predefined characters. | |
2096 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL. | |
2097 * Characters to be preserved:: No transformation needed for these characters. | |
2098 * The program to decode to internal format:: . | |
2099 * The program to encode from internal format:: . | |
2100 | |
2101 @end menu | |
2102 | |
2103 @node Four bits to ASCII, URI Encoding constants, , CCL Example | |
2104 @subsubsection Four bits to ASCII | |
2105 | |
2106 The first @code{defvar} is for | |
2107 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that | |
2108 maps from an octet's value to the ASCII encoding for the hex value of | |
2109 its most significant four bits. That might sound complex, but it isn't; | |
2110 for decimal 65, hex value @samp{#x41}, the entry in the table is the | |
2111 ASCII encoding of `4'. For decimal 122, ASCII `z', hex value | |
2112 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)} | |
2113 after this file is loaded gives the ASCII encoding of 7. | |
2114 | |
2115 @example | |
2116 (defvar url-coding-high-order-nybble-as-ascii | |
2117 (let ((val (make-vector 256 0)) | |
2118 (i 0)) | |
2119 (while (< i (length val)) | |
2120 (aset val i (char-int (aref (format "%02X" i) 0))) | |
2121 (setq i (1+ i))) | |
2122 val) | |
2123 "Table to find an ASCII version of an octet's most significant 4 bits.") | |
2124 @end example | |
2125 | |
2126 The next table, @code{url-coding-low-order-nybble-as-ascii} is almost | |
2127 the same thing, but this time it has a map for the hex encoding of the | |
2128 low-order four bits. So the sixty-fifth entry (offset @samp{#x51}) is | |
2129 the ASCII encoding of `1', the hundred-and-twenty-second (offset | |
2130 @samp{#x7a}) is the ASCII encoding of `A'. | |
2131 | |
2132 @example | |
2133 (defvar url-coding-low-order-nybble-as-ascii | |
2134 (let ((val (make-vector 256 0)) | |
2135 (i 0)) | |
2136 (while (< i (length val)) | |
2137 (aset val i (char-int (aref (format "%02X" i) 1))) | |
2138 (setq i (1+ i))) | |
2139 val) | |
2140 "Table to find an ASCII version of an octet's least significant 4 bits.") | |
2141 @end example | |
2142 | |
2143 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example | |
2144 @subsubsection URI Encoding constants | |
2145 | |
2146 Next, we have a couple of variables that make the CCL code more | |
2147 readable. The first is the ASCII encoding of the percentage sign; this | |
2148 character is used as an escape code, to start the encoding of a | |
2149 non-printable character. For historical reasons, URL encoding allows | |
2150 the space character to be encoded as a plus sign--it does make typing | |
2151 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and | |
2152 as such, we have to check when decoding for this value, and map it to | |
2153 the space character. When doing this in CCL, we use the | |
2154 @code{url-coding-escaped-space-code} variable. | |
2155 | |
2156 @example | |
2157 (defvar url-coding-escape-character-code (char-int ?%) | |
2158 "The code point for the percentage sign, in ASCII.") | |
2159 | |
2160 (defvar url-coding-escaped-space-code (char-int ?+) | |
2161 "The URL-encoded value of the space character, that is, +.") | |
2162 @end example | |
2163 | |
2164 @node Numeric to ASCII-hexadecimal conversion | |
2165 @subsubsection Numeric to ASCII-hexadecimal conversion | |
2166 | |
2167 Now, we have a couple of utility tables that wouldn't be necessary in | |
2168 a more expressive programming language than is CCL. The first is sixteen | |
2169 in length, and maps a hexadecimal number to the ASCII encoding of that | |
2170 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second | |
2171 does the reverse; that is, it maps an ASCII character to its value when | |
2172 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as | |
2173 a few examples.) | |
2174 | |
2175 @example | |
2176 (defvar url-coding-hex-digit-table | |
2177 (let ((i 0) | |
2178 (val (make-vector 16 0))) | |
2179 (while (< i 16) | |
2180 (aset val i (char-int (aref (format "%X" i) 0))) | |
2181 (setq i (1+ i))) | |
2182 val) | |
2183 "A map from a hexadecimal digit's numeric value to its encoding in ASCII.") | |
2184 | |
2185 (defvar url-coding-latin-1-as-hex-table | |
2186 (let ((val (make-vector 256 0)) | |
2187 (i 0)) | |
2188 (while (< i (length val)) | |
2189 ;; Get a hex val for this ASCII character. | |
2190 (aset val i (string-to-int (format "%c" i) 16)) | |
2191 (setq i (1+ i))) | |
2192 val) | |
2193 "A map from Latin 1 code points to their values as hexadecimal digits.") | |
2194 @end example | |
2195 | |
2196 @node Characters to be preserved | |
2197 @subsubsection Characters to be preserved | |
2198 | |
2199 And finally, the last of these tables. URL encoding says that | |
2200 alphanumeric characters, the underscore, hyphen and the full stop | |
2201 @footnote{That's what the standards call it, though my North American | |
2202 readers will be more familiar with it as the period character.} retain | |
2203 their ASCII encoding, and don't undergo transformation. | |
2204 @code{url-coding-should-preserve-table} is an array in which the entries | |
2205 are one if the corresponding ASCII character should be left as-is, and | |
2206 zero if they should be transformed. So the entries for all the control | |
2207 and most of the punctuation charcters are zero. Lisp programmers will | |
2208 observe that this initialization is particularly inefficient, but | |
2209 they'll also be aware that this is a long way from an inner loop where | |
2210 every nanosecond counts. | |
2211 | |
2212 @example | |
2213 (defvar url-coding-should-preserve-table | |
2214 (let ((preserve | |
2215 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o | |
2216 ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G | |
2217 ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y | |
2218 ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9)) | |
2219 (i 0) | |
2220 (res (make-vector 256 0))) | |
2221 (while (< i 256) | |
2222 (when (member (int-char i) preserve) | |
2223 (aset res i 1)) | |
2224 (setq i (1+ i))) | |
2225 res) | |
2226 "A 256-entry array of flags, indicating whether or not to preserve an | |
2227 octet as its ASCII encoding.") | |
2228 @end example | |
2229 | |
2230 @node The program to decode to internal format | |
2231 @subsubsection The program to decode to internal format | |
2232 | |
2233 After the almost interminable tables, we get to the CCL. The first | |
2234 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to | |
2235 our internal format; since this version of CCL doesn't have support for | |
2236 error checking on the input, we don't do any verification on it. | |
2237 | |
2238 The buffer magnification--approximate ratio of the size of the output | |
2239 buffer to the size of the input buffer--is declared as one, because | |
2240 fractional values aren't allowed. (Since all those %20's will map to | |
2241 ` ', the length of the output text will be less than that of the input | |
2242 text.) | |
2243 | |
2244 So, first we read an octet from the input buffer into register | |
2245 @samp{r0}, to set up the loop. Next, we start the loop, with a | |
2246 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a | |
2247 percentage sign. (Note the comma before | |
2248 @code{url-coding-escape-character-code}; since CCL is a Lisp macro | |
2249 language, we can break out of the macro evaluation with a comman, and as | |
2250 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a | |
2251 literal `37.') | |
2252 | |
2253 If it is a percentage sign, we read the next two octets into @samp{r2} | |
2254 and @samp{r3}, and convert them into their hexadecimal numeric values, | |
2255 using the @code{url-coding-latin-1-as-hex-table} array declared above. | |
2256 (But again, it'll be interpreted as a literal array.) We then left | |
2257 shift the first by four bits, mask the two together, and write the | |
2258 result to the output buffer. | |
2259 | |
2260 If it isn't a percentage sign, and it is a `+' sign, we write a | |
2261 space--hexadecimal 20--to the output buffer. | |
2262 | |
2263 If none of those things are true, we pass the octet to the output buffer | |
2264 untransformed. (This could be a place to put error checking, in a more | |
2265 expressive language.) We then read one more octet from the input | |
2266 buffer, and move to the next iteration of the loop. | |
2267 | |
2268 @example | |
2269 (define-ccl-program ccl-decode-urlcoding | |
2270 `(1 | |
2271 ((read r0) | |
2272 (loop | |
2273 (if (r0 == ,url-coding-escape-character-code) | |
2274 ((read r2 r3) | |
2275 ;; Assign the value at offset r2 in the url-coding-hex-digit-table | |
2276 ;; to r3. | |
2277 (r2 = r2 ,url-coding-latin-1-as-hex-table) | |
2278 (r3 = r3 ,url-coding-latin-1-as-hex-table) | |
2279 (r2 <<= 4) | |
2280 (r3 |= r2) | |
2281 (write r3)) | |
2282 (if (r0 == ,url-coding-escaped-space-code) | |
2283 (write #x20) | |
2284 (write r0))) | |
2285 (read r0) | |
2286 (repeat)))) | |
2287 "CCL program to take URI-encoded ASCII text and transform it to our | |
2288 internal encoding. ") | |
2289 @end example | |
2290 | |
2291 @node The program to encode from internal format | |
2292 @subsubsection The program to encode from internal format | |
2293 | |
2294 Next, we see the CCL program to encode ASCII text as URL coded text. | |
2295 Here, the buffer magnification is specified as three, to account for ` ' | |
2296 mapping to %20, etc. As before, we read an octet from the input into | |
2297 @samp{r0}, and move into the body of the loop. Next, we check if we | |
2298 should preserve the value of this octet, by reading from offset | |
2299 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}. | |
2300 Then we have an @samp{if} statement predicated on the value in | |
2301 @samp{r1}; for the true branch, we write the input octet directly. For | |
2302 the false branch, we write a percentage sign, the ASCII encoding of the | |
2303 high four bits in hex, and then the ASCII encoding of the low four bits | |
2304 in hex. | |
2305 | |
2306 We then read an octet from the input into @samp{r0}, and repeat the loop. | |
2307 | |
2308 @example | |
2309 (define-ccl-program ccl-encode-urlcoding | |
2310 `(3 | |
2311 ((read r0) | |
2312 (loop | |
2313 (r1 = r0 ,url-coding-should-preserve-table) | |
2314 ;; If we should preserve the value, just write the octet directly. | |
2315 (if r1 | |
2316 (write r0) | |
2317 ;; else, write a percentage sign, and the hex value of the octet, in | |
2318 ;; an ASCII-friendly format. | |
2319 ((write ,url-coding-escape-character-code) | |
2320 (write r0 ,url-coding-high-order-nybble-as-ascii) | |
2321 (write r0 ,url-coding-low-order-nybble-as-ascii))) | |
2322 (read r0) | |
2323 (repeat)))) | |
2324 "CCL program to encode octets (almost) according to RFC 1738") | |
2325 @end example | |
2060 | 2326 |
2061 @node Category Tables, Unicode Support, CCL, MULE | 2327 @node Category Tables, Unicode Support, CCL, MULE |
2062 @section Category Tables | 2328 @section Category Tables |
2063 | 2329 |
2064 A category table is a type of char table used for keeping track of | 2330 A category table is a type of char table used for keeping track of |