diff man/internals/internals.texi @ 318:afd57c14dfc8 r21-0b57

Import from CVS: tag r21-0b57
author cvs
date Mon, 13 Aug 2007 10:45:36 +0200
parents 70ad99077275
children 8bec6624d99b
line wrap: on
line diff
--- a/man/internals/internals.texi	Mon Aug 13 10:44:47 2007 +0200
+++ b/man/internals/internals.texi	Mon Aug 13 10:45:36 2007 +0200
@@ -598,6 +598,8 @@
 version 20.1 released September 17, 1997.
 @item
 version 20.2 released September 20, 1997.
+@item
+version 20.3 released August 19, 1998.
 @end itemize
 
 @node XEmacs
@@ -1654,6 +1656,7 @@
 * General Coding Rules::
 * Writing Lisp Primitives::
 * Adding Global Lisp Variables::
+* Coding for Mule::
 * Techniques for XEmacs Developers::
 @end menu
 
@@ -1754,7 +1757,7 @@
     @{
       val = Feval (XCAR (args));
       if (!NILP (val))
-	break;
+        break;
       args = XCDR (args);
     @}
 
@@ -2024,6 +2027,352 @@
 Lisp object, and you will be the one who's unhappy when you can't figure
 out how your variable got overwritten.
 
+@node Coding for Mule
+@section Coding for Mule
+@cindex Coding for Mule
+
+Although Mule support is not compiled by default in XEmacs, many people
+are using it, and we consider it crucial that new code works correctly
+with multibyte characters.  This is not hard; it is only a matter of
+following several simple user-interface guidelines.  Even if you never
+compile with Mule, with a little practice you will find it quite easy
+to code Mule-correctly.
+
+Note that these guidelines are not necessarily tied to the current Mule
+implementation; they are also a good idea to follow on the grounds of
+code generalization for future I18N work.
+
+@menu
+* Character-Related Data Types::
+* Working With Character and Byte Positions::
+* Conversion of External Data::
+* General Guidelines for Writing Mule-Aware Code::
+* An Example of Mule-Aware Code::
+@end menu
+
+@node Character-Related Data Types
+@subsection Character-Related Data Types
+
+First, we will list the basic character-related datatypes used by
+XEmacs.  Note that the separate @code{typedef}s are not required for the 
+code to work (all of them boil down to @code{unsigned char} or
+@code{int}), but they improve clarity of code a great deal, because one
+glance at the declaration can tell the intended use of the variable.
+
+@table @code
+@item Emchar
+@cindex Emchar
+An @code{Emchar} holds a single Emacs character.
+
+Obviously, the equality between characters and bytes is lost in the Mule
+world.  Characters can be represented by one or more bytes in the
+buffer, and @code{Emchar} is the C type large enough to hold any
+character.
+
+Without Mule support, an @code{Emchar} is equivalent to an
+@code{unsigned char}.
+
+@item Bufbyte
+@cindex Bufbyte
+The data representing the text in a buffer or string is logically a set
+of @code{Bufbyte}s.
+
+XEmacs does not work with character formats all the time; when reading
+characters from the outside, it decodes them to an internal format, and
+likewise encodes them when writing.  @code{Bufbyte} (in fact
+@code{unsigned char}) is the basic unit of XEmacs internal buffers and
+strings format.
+
+One character can correspond to one or more @code{Bufbyte}s.  In the
+current implementation, an ASCII character is represented by the same
+@code{Bufbyte}, and extended characters are represented by a sequence of
+@code{Bufbyte}s.
+
+Without Mule support, a @code{Bufbyte} is equivalent to an
+@code{Emchar}.
+
+@item Bufpos
+@itemx Charcount
+A @code{Bufpos} represents a character position in a buffer or string.
+A @code{Charcount} represents a number (count) of characters.
+Logically, subtracting two @code{Bufpos} values yields a
+@code{Charcount} value.  Although all of these are @code{typedef}ed to
+@code{int}, we use them in preference to @code{int} to make it clear
+what sort of position is being used.
+
+@code{Bufpos} and @code{Charcount} values are the only ones that are
+ever visible to Lisp.
+
+@item Bytind
+@itemx Bytecount
+A @code{Bytind} represents a byte position in a buffer or string.  A
+@code{Bytecount} represents the distance between two positions in bytes.
+The relationship between @code{Bytind} and @code{Bytecount} is the same
+as the relationship between @code{Bufpos} and @code{Charcount}.
+
+@item Extbyte
+@itemx Extcount
+When dealing with the outside world, XEmacs works with @code{Extbyte}s,
+which are equivalent to @code{unsigned char}.  Obviously, an
+@code{Extcount} is the distance between two @code{Extbyte}s.  Extbytes
+and Extcounts are not all that frequent in XEmacs code.
+@end table
+
+@node Working With Character and Byte Positions
+@subsection Working With Character and Byte Positions
+
+Now that we have defined the basic character-related types, we can look
+at the macros and functions designed for work with them and for
+conversion between them.  Most of these macros are defined in
+@file{buffer.h}, and we don't discuss all of them here, but only the
+most important ones.  Examining the existing code is the best way to
+learn about them.
+
+@table @code
+@item MAX_EMCHAR_LEN
+This preprocessor constant is the maximum number of buffer bytes per
+Emacs character, i.e. the byte length of an @code{Emchar}.  It is useful
+when allocating temporary strings to keep a known number of characters.
+For instance:
+
+@example
+@group
+@{
+  Charcount cclen;
+  ...
+  @{
+    /* Allocate place for @var{cclen} characters. */
+    Bufbyte *tmp_buf = (Bufbyte *)alloca (cclen * MAX_EMCHAR_LEN);
+...
+@end group
+@end example
+
+If you followed the previous section, you can guess that, logically,
+multiplying a @code{Charcount} value with @code{MAX_EMCHAR_LEN} produces 
+a @code{Bytecount} value.
+
+In the current Mule implementation, @code{MAX_EMCHAR_LEN} equals 4.
+Without Mule, it is 1.
+
+@item charptr_emchar
+@item set_charptr_emchar
+@code{charptr_emchar} macro takes a @code{Bufbyte} pointer and returns
+the underlying @code{Emchar}.  If it were a function, its prototype
+would be:
+
+@example
+Emchar charptr_emchar (Bufbyte *p);
+@end example
+
+@code{set_charptr_emchar} stores an @code{Emchar} to the specified byte
+position.  It returns the number of bytes stored:
+
+@example
+Bytecount set_charptr_emchar (Bufbyte *p, Emchar c);
+@end example
+
+It is important to note that @code{set_charptr_emchar} is safe only for
+appending a character at the end of a buffer, not for overwriting a
+character in the middle.  This is because the width of characters
+varies, and @code{set_charptr_emchar} cannot resize the string if it
+writes, say, a two-byte character where a single-byte character used to
+reside.
+
+A typical use of @code{set_charptr_emchar} can be demonstrated by this
+example, which copies characters from buffer @var{buf} to a temporary
+string of Bufbytes.
+
+@example
+@group
+@{
+  Bufpos pos;
+  for (pos = beg; pos < end; pos++)
+    @{
+      Emchar c = BUF_FETCH_CHAR (buf, pos);
+      p += set_charptr_emchar (buf, c);
+    @}
+@}
+@end group
+@end example
+
+Note how @code{set_charptr_emchar} is used to store the @code{Emchar}
+and increment the counter, at the same time.
+
+@item INC_CHARPTR
+@itemx DEC_CHARPTR
+These two macros increment and decrement a @code{Bufbyte} pointer,
+respectively.  The pointer needs to be correctly positioned at the
+beginning of a valid character position.
+
+Without Mule support, @code{INC_CHARPTR (p)} and @code{DEC_CHARPTR (p)}
+simply expand to @code{p++} and @code{p--}, respectively.
+
+@item bytecount_to_charcount
+Given a pointer to a text string and a length in bytes, return the
+equivalent length in characters.
+
+@example
+Charcount bytecount_to_charcount (Bufbyte *p, Bytecount bc);
+@end example
+
+@item charcount_to_bytecount
+Given a pointer to a text string and a length in characters, return the
+equivalent length in bytes.
+
+@example
+Bytecount charcount_to_bytecount (Bufbyte *p, Charcount cc);
+@end example
+
+@item charptr_n_addr
+Return a pointer to the beginning of the character offset @var{cc} (in
+characters) from @var{p}.
+
+@example
+Bufbyte *charptr_n_addr (Bufbyte *p, Charcount cc);
+@end example
+@end table
+
+@node Conversion of External Data
+@subsection Conversion of External Data
+
+When an external function, such as a C library function, returns a
+@code{char} pointer, you should never treat it as @code{Bufbyte}.  This
+is because these returned strings may contain 8bit characters which can
+be misinterpreted by XEmacs, and cause a crash.  Instead, you should use
+a conversion macro.  Many different conversion macros are defined in
+@file{buffer.h}, so I will try to order them logically, by direction and
+by format.
+
+Thus the basic conversion macros are @code{GET_CHARPTR_INT_DATA_ALLOCA}
+and @code{GET_CHARPTR_EXT_DATA_ALLOCA}.  The former is used to convert
+external data to internal format, and the latter is used to convert the
+other way around.  The arguments each of these receives are @var{ptr}
+(pointer to the text in external format), @var{len} (length of texts in
+bytes), @var{fmt} (format of the external text), @var{ptr_out} (lvalue
+to which new text should be copied), and @var{len_out} (lvalue which
+will be assigned the length of the internal text in bytes).  The
+resulting text is stored to a stack-allocated buffer.  If the text
+doesn't need changing, these macros will do nothing, except for setting
+@var{len_out}.
+
+Currently meaningful formats are @code{FORMAT_BINARY},
+@code{FORMAT_FILENAME}, @code{FORMAT_OS}, and @code{FORMAT_CTEXT}.
+
+The two macros above take many arguments which makes them unwieldy.  For
+this reason, several convenience macros are defined with obvious
+functionality, but accepting less arguments:
+
+@table @code
+@item GET_C_CHARPTR_EXT_DATA_ALLOCA
+@itemx GET_C_CHARPTR_INT_DATA_ALLOCA
+These two macros work on ``C char pointers'', which are zero-terminated, 
+and thus do not need @var{len} or @var{len_out} parameters.
+
+@item GET_STRING_EXT_DATA_ALLOCA
+@itemx GET_C_STRING_EXT_DATA_ALLOCA
+These two macros work on Lisp strings, thus also not needing a @var{len}
+parameter.  However, @code{GET_STRING_EXT_DATA_ALLOCA} still provides a
+@var{len_out} parameter.  Note that for Lisp strings only one conversion
+direction makes sense.
+
+@item GET_C_CHARPTR_EXT_BINARY_DATA_ALLOCA
+@itemx GET_C_CHARPTR_EXT_FILENAME_DATA_ALLOCA
+@itemx GET_C_CHARPTR_EXT_CTEXT_DATA_ALLOCA
+@itemx ...
+These macros are a combination of the above, but with the @var{fmt}
+argument encoded into the name of the macro.
+@end table
+
+@node General Guidelines for Writing Mule-Aware Code
+@subsection General Guidelines for Writing Mule-Aware Code
+
+This section contains some general guidance on how to write Mule-aware
+code, as well as some pitfalls you should avoid.
+
+@table @emph
+@item Never use @code{char} and @code{char *}.
+In XEmacs, the use of @code{char} and @code{char *} is almost always a
+mistake.  If you want to manipulate an Emacs character from ``C'', use
+@code{Emchar}.  If you want to examine a specific octet in the internal
+format, use @code{Bufbyte}.  If you want a Lisp-visible character, use a
+@code{Lisp_Object} and @code{make_char}.  If you want a pointer to move
+through the internal text, use @code{Bufbyte *}.  Also note that you
+almost certainly do not need @code{Emchar *}.
+
+@item Be careful not to confuse @code{Charcount}, @code{Bytecount}, and @code{Bufpos}.
+The whole point of using different types is to avoid confusion about the 
+use of certain variables.  Lest this effect be nullified, you need to be 
+careful about using the right types.
+
+@item Always convert external data
+It is extremely important to always convert external data, because
+XEmacs can crash if unexpected 8bit sequences are copied to its internal 
+buffers literally.
+
+This means that when a system function, such as @code{readdir}, returns
+a string, you need to convert it using one of the conversion macros
+described in the previous chapter, before passing it further to Lisp.
+In the case of @code{readdir}, you would use the
+@code{GET_C_CHARPTR_INT_FILENAME_DATA_ALLOCA} macro.
+
+Also note that many internal functions, such as @code{make_string},
+accept Bufbytes, which removes the need for them to convert the data
+they receive.  This increases efficiency because that way external data
+needs to be decoded only once, when it is read.  After that, it is
+passed around in internal format.
+@end table
+
+@node An Example of Mule-Aware Code
+@subsection An Example of Mule-Aware Code
+
+As an example of Mule-aware code, we shall will analyze the
+@code{string} function, which conses up a Lisp string from the character
+arguments it receives.  Here is the definition, pasted from
+@code{alloc.c}:
+
+@example
+@group
+DEFUN ("string", Fstring, 0, MANY, 0, /*
+Concatenate all the argument characters and make the result a string.
+*/
+       (int nargs, Lisp_Object *args))
+@{
+  Bufbyte *storage = alloca_array (Bufbyte, nargs * MAX_EMCHAR_LEN);
+  Bufbyte *p = storage;
+
+  for (; nargs; nargs--, args++)
+    @{
+      Lisp_Object lisp_char = *args;
+      CHECK_CHAR_COERCE_INT (lisp_char);
+      p += set_charptr_emchar (p, XCHAR (lisp_char));
+    @}
+  return make_string (storage, p - storage);
+@}
+@end group
+@end example
+
+Now we can analyze the source line by line.
+
+Obviously, string will be as long as there are arguments to the
+function.  This is why we allocate @code{MAX_EMCHAR_LEN} * @var{nargs}
+bytes on the stack, i.e. the worst-case number of bytes for @var{nargs}
+@code{Emchar}s to fit in the string.
+
+Then, the loop checks that each element is a character, converting
+integers in the process.  Like many other functions in XEmacs, this
+function silently accepts integers where characters are expected, for
+historical and compatibility reasons.  Unless you know what you are
+doing, @code{CHECK_CHAR} will also suffice.  @code{XCHAR (lisp_char)}
+extracts the @code{Emchar} from the @code{Lisp_Object}, and
+@code{set_charptr_emchar} stores it to storage, increasing @code{p} in
+the process.
+
+Other instructing examples of correct coding under Mule can be found all
+over XEmacs code.  For starters, I recommend
+@code{Fnormalize_menu_item_name} in @file{menubar.c}.  After you have
+understood this section of the manual and studied the examples, you can
+proceed writing new Mule-aware code.
+
 @node Techniques for XEmacs Developers
 @section Techniques for XEmacs Developers