Mercurial > hg > xemacs-beta
diff man/lispref/mule.texi @ 2640:a4040d921acc
[xemacs-hg @ 2005-03-09 05:36:28 by stephent]
internals and lispref <871xapfkkq.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Wed, 09 Mar 2005 05:36:50 +0000 |
parents | ecf1ebac70d8 |
children | d5bfa26d5c3f |
line wrap: on
line diff
--- a/man/lispref/mule.texi Wed Mar 09 04:59:31 2005 +0000 +++ b/man/lispref/mule.texi Wed Mar 09 05:36:50 2005 +0000 @@ -1765,7 +1765,7 @@ * CCL Statements:: Semantics of CCL statements. * CCL Expressions:: Operators and expressions in CCL. * Calling CCL:: Running CCL programs. -* CCL Examples:: The encoding functions for Big5 and KOI-8. +* CCL Example:: A trivial program to transform the Web's URL encoding. @end menu @node CCL Syntax, CCL Statements, , CCL @@ -1986,7 +1986,7 @@ Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to represent the SJIS operations in infix form. -@node Calling CCL, CCL Examples, CCL Expressions, CCL +@node Calling CCL, CCL Example, CCL Expressions, CCL @comment Node, Next, Previous, Up @subsection Calling CCL @@ -2052,11 +2052,277 @@ Resets the CCL interpreter's internal elapsed time registers. @end defun -@node CCL Examples, , Calling CCL, CCL +@node CCL Example, , Calling CCL, CCL @comment Node, Next, Previous, Up -@subsection CCL Examples - - This section is not yet written. +@subsection CCL Example + + In this section, we describe the implementation of a trivial coding +system to transform from the Web's URL encoding to XEmacs' internal +coding. Many people will have been first exposed to URL encoding when +they saw ``%20'' where they expected a space in a file's name on their +local hard disk; this can happen when a browser saves a file from the +web and doesn't encode the name, as passed from the server, properly. + + URL encoding itself is underspecified with regard to encodings beyond +ASCII. The relevant document, RFC 1738, explicitly doesn't give any +information on how to encode non-ASCII characters, and the ``obvious'' +way---use the %xx values for the octets of the eight bit MIME character +set in which the page was served---breaks when a user types a character +outside that character set. Best practice for web development is to +serve all pages as UTF-8 and treat incoming form data as using that +coding system. (Oh, and gamble that your clients won't ever want to +type anything outside Unicode. But that's not so much of a gamble with +today's client operating systems.) We don't treat non-ASCII in this +example, as dealing with @samp{(read-multibyte-character ...)} and +errors therewith would make it much harder to understand. + + Since CCL isn't a very rich language, we move much of the logic that +would ordinarily be computed from operations like @code{(member ..)}, +@code{(and ...)} and @code{(or ...)} into tables, from which register +values are read and written, and on which @code{if} statements are +predicated. Much more of the implementation of this coding system is +occupied with constructing these tables---in normal Emacs Lisp---than it +is with actual CCL code. + + All the @code{defvar} statements we deal with in the next few sections +are surrounded by a @code{(eval-and-compile ...)}, which means that the +logic which initializes these variables executes at compile time, and if +XEmacs loads the compiled version of the file, these variables are +initialized as constants. + +@menu +* Four bits to ASCII:: Two tables used for getting hex digits from ASCII. +* URI Encoding constants:: Useful predefined characters. +* Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL. +* Characters to be preserved:: No transformation needed for these characters. +* The program to decode to internal format:: . +* The program to encode from internal format:: . + +@end menu + +@node Four bits to ASCII, URI Encoding constants, , CCL Example +@subsubsection Four bits to ASCII + + The first @code{defvar} is for +@code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that +maps from an octet's value to the ASCII encoding for the hex value of +its most significant four bits. That might sound complex, but it isn't; +for decimal 65, hex value @samp{#x41}, the entry in the table is the +ASCII encoding of `4'. For decimal 122, ASCII `z', hex value +@code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)} +after this file is loaded gives the ASCII encoding of 7. + +@example +(defvar url-coding-high-order-nybble-as-ascii + (let ((val (make-vector 256 0)) + (i 0)) + (while (< i (length val)) + (aset val i (char-int (aref (format "%02X" i) 0))) + (setq i (1+ i))) + val) + "Table to find an ASCII version of an octet's most significant 4 bits.") +@end example + + The next table, @code{url-coding-low-order-nybble-as-ascii} is almost +the same thing, but this time it has a map for the hex encoding of the +low-order four bits. So the sixty-fifth entry (offset @samp{#x51}) is +the ASCII encoding of `1', the hundred-and-twenty-second (offset +@samp{#x7a}) is the ASCII encoding of `A'. + +@example +(defvar url-coding-low-order-nybble-as-ascii + (let ((val (make-vector 256 0)) + (i 0)) + (while (< i (length val)) + (aset val i (char-int (aref (format "%02X" i) 1))) + (setq i (1+ i))) + val) + "Table to find an ASCII version of an octet's least significant 4 bits.") +@end example + +@node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example +@subsubsection URI Encoding constants + + Next, we have a couple of variables that make the CCL code more +readable. The first is the ASCII encoding of the percentage sign; this +character is used as an escape code, to start the encoding of a +non-printable character. For historical reasons, URL encoding allows +the space character to be encoded as a plus sign--it does make typing +URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and +as such, we have to check when decoding for this value, and map it to +the space character. When doing this in CCL, we use the +@code{url-coding-escaped-space-code} variable. + +@example +(defvar url-coding-escape-character-code (char-int ?%) + "The code point for the percentage sign, in ASCII.") + +(defvar url-coding-escaped-space-code (char-int ?+) + "The URL-encoded value of the space character, that is, +.") +@end example + +@node Numeric to ASCII-hexadecimal conversion +@subsubsection Numeric to ASCII-hexadecimal conversion + + Now, we have a couple of utility tables that wouldn't be necessary in +a more expressive programming language than is CCL. The first is sixteen +in length, and maps a hexadecimal number to the ASCII encoding of that +number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second +does the reverse; that is, it maps an ASCII character to its value when +interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as +a few examples.) + +@example +(defvar url-coding-hex-digit-table + (let ((i 0) + (val (make-vector 16 0))) + (while (< i 16) + (aset val i (char-int (aref (format "%X" i) 0))) + (setq i (1+ i))) + val) + "A map from a hexadecimal digit's numeric value to its encoding in ASCII.") + +(defvar url-coding-latin-1-as-hex-table + (let ((val (make-vector 256 0)) + (i 0)) + (while (< i (length val)) + ;; Get a hex val for this ASCII character. + (aset val i (string-to-int (format "%c" i) 16)) + (setq i (1+ i))) + val) + "A map from Latin 1 code points to their values as hexadecimal digits.") +@end example + +@node Characters to be preserved +@subsubsection Characters to be preserved + + And finally, the last of these tables. URL encoding says that +alphanumeric characters, the underscore, hyphen and the full stop +@footnote{That's what the standards call it, though my North American +readers will be more familiar with it as the period character.} retain +their ASCII encoding, and don't undergo transformation. +@code{url-coding-should-preserve-table} is an array in which the entries +are one if the corresponding ASCII character should be left as-is, and +zero if they should be transformed. So the entries for all the control +and most of the punctuation charcters are zero. Lisp programmers will +observe that this initialization is particularly inefficient, but +they'll also be aware that this is a long way from an inner loop where +every nanosecond counts. + +@example +(defvar url-coding-should-preserve-table + (let ((preserve + (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o + ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G + ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y + ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9)) + (i 0) + (res (make-vector 256 0))) + (while (< i 256) + (when (member (int-char i) preserve) + (aset res i 1)) + (setq i (1+ i))) + res) + "A 256-entry array of flags, indicating whether or not to preserve an +octet as its ASCII encoding.") +@end example + +@node The program to decode to internal format +@subsubsection The program to decode to internal format + + After the almost interminable tables, we get to the CCL. The first +CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to +our internal format; since this version of CCL doesn't have support for +error checking on the input, we don't do any verification on it. + +The buffer magnification--approximate ratio of the size of the output +buffer to the size of the input buffer--is declared as one, because +fractional values aren't allowed. (Since all those %20's will map to +` ', the length of the output text will be less than that of the input +text.) + +So, first we read an octet from the input buffer into register +@samp{r0}, to set up the loop. Next, we start the loop, with a +@code{(loop ...)} statement, and we check if the value in @samp{r0} is a +percentage sign. (Note the comma before +@code{url-coding-escape-character-code}; since CCL is a Lisp macro +language, we can break out of the macro evaluation with a comman, and as +such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a +literal `37.') + +If it is a percentage sign, we read the next two octets into @samp{r2} +and @samp{r3}, and convert them into their hexadecimal numeric values, +using the @code{url-coding-latin-1-as-hex-table} array declared above. +(But again, it'll be interpreted as a literal array.) We then left +shift the first by four bits, mask the two together, and write the +result to the output buffer. + +If it isn't a percentage sign, and it is a `+' sign, we write a +space--hexadecimal 20--to the output buffer. + +If none of those things are true, we pass the octet to the output buffer +untransformed. (This could be a place to put error checking, in a more +expressive language.) We then read one more octet from the input +buffer, and move to the next iteration of the loop. + +@example +(define-ccl-program ccl-decode-urlcoding + `(1 + ((read r0) + (loop + (if (r0 == ,url-coding-escape-character-code) + ((read r2 r3) + ;; Assign the value at offset r2 in the url-coding-hex-digit-table + ;; to r3. + (r2 = r2 ,url-coding-latin-1-as-hex-table) + (r3 = r3 ,url-coding-latin-1-as-hex-table) + (r2 <<= 4) + (r3 |= r2) + (write r3)) + (if (r0 == ,url-coding-escaped-space-code) + (write #x20) + (write r0))) + (read r0) + (repeat)))) + "CCL program to take URI-encoded ASCII text and transform it to our +internal encoding. ") +@end example + +@node The program to encode from internal format +@subsubsection The program to encode from internal format + + Next, we see the CCL program to encode ASCII text as URL coded text. +Here, the buffer magnification is specified as three, to account for ` ' +mapping to %20, etc. As before, we read an octet from the input into +@samp{r0}, and move into the body of the loop. Next, we check if we +should preserve the value of this octet, by reading from offset +@samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}. +Then we have an @samp{if} statement predicated on the value in +@samp{r1}; for the true branch, we write the input octet directly. For +the false branch, we write a percentage sign, the ASCII encoding of the +high four bits in hex, and then the ASCII encoding of the low four bits +in hex. + +We then read an octet from the input into @samp{r0}, and repeat the loop. + +@example +(define-ccl-program ccl-encode-urlcoding + `(3 + ((read r0) + (loop + (r1 = r0 ,url-coding-should-preserve-table) + ;; If we should preserve the value, just write the octet directly. + (if r1 + (write r0) + ;; else, write a percentage sign, and the hex value of the octet, in + ;; an ASCII-friendly format. + ((write ,url-coding-escape-character-code) + (write r0 ,url-coding-high-order-nybble-as-ascii) + (write r0 ,url-coding-low-order-nybble-as-ascii))) + (read r0) + (repeat)))) + "CCL program to encode octets (almost) according to RFC 1738") +@end example @node Category Tables, Unicode Support, CCL, MULE @section Category Tables