comparison man/lispref/mule.texi @ 2640:a4040d921acc

[xemacs-hg @ 2005-03-09 05:36:28 by stephent] internals and lispref <871xapfkkq.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Wed, 09 Mar 2005 05:36:50 +0000
parents ecf1ebac70d8
children d5bfa26d5c3f
comparison
equal deleted inserted replaced
2639:cd00e5eeb22a 2640:a4040d921acc
1763 @menu 1763 @menu
1764 * CCL Syntax:: CCL program syntax in BNF notation. 1764 * CCL Syntax:: CCL program syntax in BNF notation.
1765 * CCL Statements:: Semantics of CCL statements. 1765 * CCL Statements:: Semantics of CCL statements.
1766 * CCL Expressions:: Operators and expressions in CCL. 1766 * CCL Expressions:: Operators and expressions in CCL.
1767 * Calling CCL:: Running CCL programs. 1767 * Calling CCL:: Running CCL programs.
1768 * CCL Examples:: The encoding functions for Big5 and KOI-8. 1768 * CCL Example:: A trivial program to transform the Web's URL encoding.
1769 @end menu 1769 @end menu
1770 1770
1771 @node CCL Syntax, CCL Statements, , CCL 1771 @node CCL Syntax, CCL Statements, , CCL
1772 @comment Node, Next, Previous, Up 1772 @comment Node, Next, Previous, Up
1773 @subsection CCL Syntax 1773 @subsection CCL Syntax
1984 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a 1984 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1985 complicated transformation of the Japanese standard JIS encoding to 1985 complicated transformation of the Japanese standard JIS encoding to
1986 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to 1986 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1987 represent the SJIS operations in infix form. 1987 represent the SJIS operations in infix form.
1988 1988
1989 @node Calling CCL, CCL Examples, CCL Expressions, CCL 1989 @node Calling CCL, CCL Example, CCL Expressions, CCL
1990 @comment Node, Next, Previous, Up 1990 @comment Node, Next, Previous, Up
1991 @subsection Calling CCL 1991 @subsection Calling CCL
1992 1992
1993 CCL programs are called automatically during Emacs buffer I/O when the 1993 CCL programs are called automatically during Emacs buffer I/O when the
1994 external representation has a coding system type of @code{shift-jis}, 1994 external representation has a coding system type of @code{shift-jis},
2050 2050
2051 @defun ccl-reset-elapsed-time 2051 @defun ccl-reset-elapsed-time
2052 Resets the CCL interpreter's internal elapsed time registers. 2052 Resets the CCL interpreter's internal elapsed time registers.
2053 @end defun 2053 @end defun
2054 2054
2055 @node CCL Examples, , Calling CCL, CCL 2055 @node CCL Example, , Calling CCL, CCL
2056 @comment Node, Next, Previous, Up 2056 @comment Node, Next, Previous, Up
2057 @subsection CCL Examples 2057 @subsection CCL Example
2058 2058
2059 This section is not yet written. 2059 In this section, we describe the implementation of a trivial coding
2060 system to transform from the Web's URL encoding to XEmacs' internal
2061 coding. Many people will have been first exposed to URL encoding when
2062 they saw ``%20'' where they expected a space in a file's name on their
2063 local hard disk; this can happen when a browser saves a file from the
2064 web and doesn't encode the name, as passed from the server, properly.
2065
2066 URL encoding itself is underspecified with regard to encodings beyond
2067 ASCII. The relevant document, RFC 1738, explicitly doesn't give any
2068 information on how to encode non-ASCII characters, and the ``obvious''
2069 way---use the %xx values for the octets of the eight bit MIME character
2070 set in which the page was served---breaks when a user types a character
2071 outside that character set. Best practice for web development is to
2072 serve all pages as UTF-8 and treat incoming form data as using that
2073 coding system. (Oh, and gamble that your clients won't ever want to
2074 type anything outside Unicode. But that's not so much of a gamble with
2075 today's client operating systems.) We don't treat non-ASCII in this
2076 example, as dealing with @samp{(read-multibyte-character ...)} and
2077 errors therewith would make it much harder to understand.
2078
2079 Since CCL isn't a very rich language, we move much of the logic that
2080 would ordinarily be computed from operations like @code{(member ..)},
2081 @code{(and ...)} and @code{(or ...)} into tables, from which register
2082 values are read and written, and on which @code{if} statements are
2083 predicated. Much more of the implementation of this coding system is
2084 occupied with constructing these tables---in normal Emacs Lisp---than it
2085 is with actual CCL code.
2086
2087 All the @code{defvar} statements we deal with in the next few sections
2088 are surrounded by a @code{(eval-and-compile ...)}, which means that the
2089 logic which initializes these variables executes at compile time, and if
2090 XEmacs loads the compiled version of the file, these variables are
2091 initialized as constants.
2092
2093 @menu
2094 * Four bits to ASCII:: Two tables used for getting hex digits from ASCII.
2095 * URI Encoding constants:: Useful predefined characters.
2096 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL.
2097 * Characters to be preserved:: No transformation needed for these characters.
2098 * The program to decode to internal format:: .
2099 * The program to encode from internal format:: .
2100
2101 @end menu
2102
2103 @node Four bits to ASCII, URI Encoding constants, , CCL Example
2104 @subsubsection Four bits to ASCII
2105
2106 The first @code{defvar} is for
2107 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that
2108 maps from an octet's value to the ASCII encoding for the hex value of
2109 its most significant four bits. That might sound complex, but it isn't;
2110 for decimal 65, hex value @samp{#x41}, the entry in the table is the
2111 ASCII encoding of `4'. For decimal 122, ASCII `z', hex value
2112 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)}
2113 after this file is loaded gives the ASCII encoding of 7.
2114
2115 @example
2116 (defvar url-coding-high-order-nybble-as-ascii
2117 (let ((val (make-vector 256 0))
2118 (i 0))
2119 (while (< i (length val))
2120 (aset val i (char-int (aref (format "%02X" i) 0)))
2121 (setq i (1+ i)))
2122 val)
2123 "Table to find an ASCII version of an octet's most significant 4 bits.")
2124 @end example
2125
2126 The next table, @code{url-coding-low-order-nybble-as-ascii} is almost
2127 the same thing, but this time it has a map for the hex encoding of the
2128 low-order four bits. So the sixty-fifth entry (offset @samp{#x51}) is
2129 the ASCII encoding of `1', the hundred-and-twenty-second (offset
2130 @samp{#x7a}) is the ASCII encoding of `A'.
2131
2132 @example
2133 (defvar url-coding-low-order-nybble-as-ascii
2134 (let ((val (make-vector 256 0))
2135 (i 0))
2136 (while (< i (length val))
2137 (aset val i (char-int (aref (format "%02X" i) 1)))
2138 (setq i (1+ i)))
2139 val)
2140 "Table to find an ASCII version of an octet's least significant 4 bits.")
2141 @end example
2142
2143 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example
2144 @subsubsection URI Encoding constants
2145
2146 Next, we have a couple of variables that make the CCL code more
2147 readable. The first is the ASCII encoding of the percentage sign; this
2148 character is used as an escape code, to start the encoding of a
2149 non-printable character. For historical reasons, URL encoding allows
2150 the space character to be encoded as a plus sign--it does make typing
2151 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and
2152 as such, we have to check when decoding for this value, and map it to
2153 the space character. When doing this in CCL, we use the
2154 @code{url-coding-escaped-space-code} variable.
2155
2156 @example
2157 (defvar url-coding-escape-character-code (char-int ?%)
2158 "The code point for the percentage sign, in ASCII.")
2159
2160 (defvar url-coding-escaped-space-code (char-int ?+)
2161 "The URL-encoded value of the space character, that is, +.")
2162 @end example
2163
2164 @node Numeric to ASCII-hexadecimal conversion
2165 @subsubsection Numeric to ASCII-hexadecimal conversion
2166
2167 Now, we have a couple of utility tables that wouldn't be necessary in
2168 a more expressive programming language than is CCL. The first is sixteen
2169 in length, and maps a hexadecimal number to the ASCII encoding of that
2170 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second
2171 does the reverse; that is, it maps an ASCII character to its value when
2172 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as
2173 a few examples.)
2174
2175 @example
2176 (defvar url-coding-hex-digit-table
2177 (let ((i 0)
2178 (val (make-vector 16 0)))
2179 (while (< i 16)
2180 (aset val i (char-int (aref (format "%X" i) 0)))
2181 (setq i (1+ i)))
2182 val)
2183 "A map from a hexadecimal digit's numeric value to its encoding in ASCII.")
2184
2185 (defvar url-coding-latin-1-as-hex-table
2186 (let ((val (make-vector 256 0))
2187 (i 0))
2188 (while (< i (length val))
2189 ;; Get a hex val for this ASCII character.
2190 (aset val i (string-to-int (format "%c" i) 16))
2191 (setq i (1+ i)))
2192 val)
2193 "A map from Latin 1 code points to their values as hexadecimal digits.")
2194 @end example
2195
2196 @node Characters to be preserved
2197 @subsubsection Characters to be preserved
2198
2199 And finally, the last of these tables. URL encoding says that
2200 alphanumeric characters, the underscore, hyphen and the full stop
2201 @footnote{That's what the standards call it, though my North American
2202 readers will be more familiar with it as the period character.} retain
2203 their ASCII encoding, and don't undergo transformation.
2204 @code{url-coding-should-preserve-table} is an array in which the entries
2205 are one if the corresponding ASCII character should be left as-is, and
2206 zero if they should be transformed. So the entries for all the control
2207 and most of the punctuation charcters are zero. Lisp programmers will
2208 observe that this initialization is particularly inefficient, but
2209 they'll also be aware that this is a long way from an inner loop where
2210 every nanosecond counts.
2211
2212 @example
2213 (defvar url-coding-should-preserve-table
2214 (let ((preserve
2215 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o
2216 ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G
2217 ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y
2218 ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9))
2219 (i 0)
2220 (res (make-vector 256 0)))
2221 (while (< i 256)
2222 (when (member (int-char i) preserve)
2223 (aset res i 1))
2224 (setq i (1+ i)))
2225 res)
2226 "A 256-entry array of flags, indicating whether or not to preserve an
2227 octet as its ASCII encoding.")
2228 @end example
2229
2230 @node The program to decode to internal format
2231 @subsubsection The program to decode to internal format
2232
2233 After the almost interminable tables, we get to the CCL. The first
2234 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to
2235 our internal format; since this version of CCL doesn't have support for
2236 error checking on the input, we don't do any verification on it.
2237
2238 The buffer magnification--approximate ratio of the size of the output
2239 buffer to the size of the input buffer--is declared as one, because
2240 fractional values aren't allowed. (Since all those %20's will map to
2241 ` ', the length of the output text will be less than that of the input
2242 text.)
2243
2244 So, first we read an octet from the input buffer into register
2245 @samp{r0}, to set up the loop. Next, we start the loop, with a
2246 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a
2247 percentage sign. (Note the comma before
2248 @code{url-coding-escape-character-code}; since CCL is a Lisp macro
2249 language, we can break out of the macro evaluation with a comman, and as
2250 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a
2251 literal `37.')
2252
2253 If it is a percentage sign, we read the next two octets into @samp{r2}
2254 and @samp{r3}, and convert them into their hexadecimal numeric values,
2255 using the @code{url-coding-latin-1-as-hex-table} array declared above.
2256 (But again, it'll be interpreted as a literal array.) We then left
2257 shift the first by four bits, mask the two together, and write the
2258 result to the output buffer.
2259
2260 If it isn't a percentage sign, and it is a `+' sign, we write a
2261 space--hexadecimal 20--to the output buffer.
2262
2263 If none of those things are true, we pass the octet to the output buffer
2264 untransformed. (This could be a place to put error checking, in a more
2265 expressive language.) We then read one more octet from the input
2266 buffer, and move to the next iteration of the loop.
2267
2268 @example
2269 (define-ccl-program ccl-decode-urlcoding
2270 `(1
2271 ((read r0)
2272 (loop
2273 (if (r0 == ,url-coding-escape-character-code)
2274 ((read r2 r3)
2275 ;; Assign the value at offset r2 in the url-coding-hex-digit-table
2276 ;; to r3.
2277 (r2 = r2 ,url-coding-latin-1-as-hex-table)
2278 (r3 = r3 ,url-coding-latin-1-as-hex-table)
2279 (r2 <<= 4)
2280 (r3 |= r2)
2281 (write r3))
2282 (if (r0 == ,url-coding-escaped-space-code)
2283 (write #x20)
2284 (write r0)))
2285 (read r0)
2286 (repeat))))
2287 "CCL program to take URI-encoded ASCII text and transform it to our
2288 internal encoding. ")
2289 @end example
2290
2291 @node The program to encode from internal format
2292 @subsubsection The program to encode from internal format
2293
2294 Next, we see the CCL program to encode ASCII text as URL coded text.
2295 Here, the buffer magnification is specified as three, to account for ` '
2296 mapping to %20, etc. As before, we read an octet from the input into
2297 @samp{r0}, and move into the body of the loop. Next, we check if we
2298 should preserve the value of this octet, by reading from offset
2299 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}.
2300 Then we have an @samp{if} statement predicated on the value in
2301 @samp{r1}; for the true branch, we write the input octet directly. For
2302 the false branch, we write a percentage sign, the ASCII encoding of the
2303 high four bits in hex, and then the ASCII encoding of the low four bits
2304 in hex.
2305
2306 We then read an octet from the input into @samp{r0}, and repeat the loop.
2307
2308 @example
2309 (define-ccl-program ccl-encode-urlcoding
2310 `(3
2311 ((read r0)
2312 (loop
2313 (r1 = r0 ,url-coding-should-preserve-table)
2314 ;; If we should preserve the value, just write the octet directly.
2315 (if r1
2316 (write r0)
2317 ;; else, write a percentage sign, and the hex value of the octet, in
2318 ;; an ASCII-friendly format.
2319 ((write ,url-coding-escape-character-code)
2320 (write r0 ,url-coding-high-order-nybble-as-ascii)
2321 (write r0 ,url-coding-low-order-nybble-as-ascii)))
2322 (read r0)
2323 (repeat))))
2324 "CCL program to encode octets (almost) according to RFC 1738")
2325 @end example
2060 2326
2061 @node Category Tables, Unicode Support, CCL, MULE 2327 @node Category Tables, Unicode Support, CCL, MULE
2062 @section Category Tables 2328 @section Category Tables
2063 2329
2064 A category table is a type of char table used for keeping track of 2330 A category table is a type of char table used for keeping track of