xemacs-beta: man/internals/internals.texi comparison

comparison man/internals/internals.texi @ 1261:465bd3c7d932

[xemacs-hg @ 2003-02-06 06:35:47 by ben] various bug fixes mule/cyril-util.el: Fix compile warning. loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we get better debug results. update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure. simple.el: Omit M-S-home/end from motion keys. update-elc.el: Overhaul: -- allow list of "early-compile" files to be specified, not hardcoded -- fix autoload checking to include all .el files, not just dumped ones -- be smarter about regenerating autoloads, so we don't need to use loadup-el if not necessary -- use standard methods for loading/not loading auto-autoloads.el (maybe fixes "Already loaded" error?) -- rename misleading NOBYTECOMPILE flag file. window-xemacs.el: Fix bug in default param. window-xemacs.el: Fix compile warnings. lwlib-Xm.c: Fix compile warning. lispref/mule.texi: Lots of Mule rewriting. internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and lots of Mule rewriting. config.inc.samp: Various fixups. Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE. esd.c: Warning fixes. fns.c: Eliminate bogus require-prints-loading-message; use already existent load-always-display-messages instead. Make sure `load' knows we are coming from `require'. lread.c: Turn on `load-warn-when-source-newer' by default. Change loading message to indicate when we are `require'ing. Eliminate purify_flag hacks to display more messages; instead, loadup and friends specify this explicitly with `load-always-display-messages'. Add spaces when batch to clearly indicate recursive loading. Fassoc() does not GC so no need to gcpro. gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing of callbacks in lwlib structures. eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or backtraces get all screwed up as some values are printed through the stream console and some aren't. Export canonicalize_printcharfun() and use in Fbacktrace().

author	ben
date	Thu, 06 Feb 2003 06:36:17 +0000
parents	c1553814932e
children	bada4b0bce3a

comparison

equal deleted inserted replaced

-:278c9cd3435e
+:465bd3c7d932
 * Introduction to Buffers::     A buffer holds a block of text such as a file.
 * The Text in a Buffer::        Representation of the text in a buffer.
 * Buffer Lists::                Keeping track of all buffers.
 * Markers and Extents::         Tagging locations within a buffer.
-* Bufbytes and Emchars::        Representation of individual characters.
+* Ibytes and Ichars::        Representation of individual characters.
 * The Buffer Object::           The Lisp object corresponding to a buffer.
 MULE Character Sets and Encodings
 * Character Sets::
 * Character-Related Data Types::
 * Working With Character and Byte Positions::
 * Conversion to and from External Data::
 * General Guidelines for Writing Mule-Aware Code::
 * An Example of Mule-Aware Code::
+* Mule-izing Code::
 @end menu
 @node Character-Related Data Types
 @subsection Character-Related Data Types
 @cindex character-related data types
 @cindex data types, character-related
 First, let's review the basic character-related datatypes used by
-XEmacs.  Note that the separate @code{typedef}s are not mandatory in the
+XEmacs.  Note that some of the separate @code{typedef}s are not
-current implementation (all of them boil down to @code{unsigned char} or
+mandatory, but they improve clarity of code a great deal, because one
-@code{int}), but they improve clarity of code a great deal, because one
 glance at the declaration can tell the intended use of the variable.
 @table @code
-@item Emchar
+@item Ichar
-@cindex Emchar
+@cindex Ichar
-An @code{Emchar} holds a single Emacs character.
+An @code{Ichar} holds a single Emacs character.
 Obviously, the equality between characters and bytes is lost in the Mule
 world.  Characters can be represented by one or more bytes in the
-buffer, and @code{Emchar} is the C type large enough to hold any
+buffer, and @code{Ichar} is the C type large enough to hold any
 character.
-Without Mule support, an @code{Emchar} is equivalent to an
+Without Mule support, an @code{Ichar} is equivalent to an
 @code{unsigned char}.
-@item Bufbyte
+@item Ibyte
-@cindex Bufbyte
+@cindex Ibyte
 The data representing the text in a buffer or string is logically a set
-of @code{Bufbyte}s.
+of @code{Ibyte}s.
 XEmacs does not work with the same character formats all the time; when
 reading characters from the outside, it decodes them to an internal
-format, and likewise encodes them when writing.  @code{Bufbyte} (in fact
+format, and likewise encodes them when writing.  @code{Ibyte} (in fact
 @code{unsigned char}) is the basic unit of XEmacs internal buffers and
-strings format.  A @code{Bufbyte *} is the type that points at text
+strings format.  A @code{Ibyte *} is the type that points at text
 encoded in the variable-width internal encoding.
-One character can correspond to one or more @code{Bufbyte}s.  In the
+One character can correspond to one or more @code{Ibyte}s.  In the
 current Mule implementation, an ASCII character is represented by the
-same @code{Bufbyte}, and other characters are represented by a sequence
+same @code{Ibyte}, and other characters are represented by a sequence
-of two or more @code{Bufbyte}s.
+of two or more @code{Ibyte}s.
 Without Mule support, there are exactly 256 characters, implicitly
-Latin-1, and each character is represented using one @code{Bufbyte}, and
+Latin-1, and each character is represented using one @code{Ibyte}, and
-there is a one-to-one correspondence between @code{Bufbyte}s and
+there is a one-to-one correspondence between @code{Ibyte}s and
-@code{Emchar}s.
+@code{Ichar}s.
-@item Bufpos
+@item Charxpos
+@item Charbpos
 @itemx Charcount
-@cindex Bufpos
+@cindex Charxpos
+@cindex Charbpos
 @cindex Charcount
-A @code{Bufpos} represents a character position in a buffer or string.
+A @code{Charbpos} represents a character position in a buffer.  A
-A @code{Charcount} represents a number (count) of characters.
+@code{Charcount} represents a number (count) of characters.  Logically,
-Logically, subtracting two @code{Bufpos} values yields a
+subtracting two @code{Charbpos} values yields a @code{Charcount} value.
-@code{Charcount} value.  Although all of these are @code{typedef}ed to
+When representing a character position in a string, we just use
+@code{Charcount} directly.  The reason for having a separate typedef for
+buffer positions is that they are 1-based, whereas string positions are
+0-based and hence string counts and positions can be freely intermixed (a
+string position is equivalent to the count of characters from the
+beginning).  When representing a character position that could be either
+in a buffer or string (for example, in the extent code), @code{Charxpos}
+is used.  Although all of these are @code{typedef}ed to
 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make
 it clear what sort of position is being used.
-@code{Bufpos} and @code{Charcount} values are the only ones that are
+@code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the
-ever visible to Lisp.
+only ones that are ever visible to Lisp.
-@item Bytind
+@item Bytexpos
 @itemx Bytecount
-@cindex Bytind
+@cindex Bytebpos
 @cindex Bytecount
-A @code{Bytind} represents a byte position in a buffer or string.  A
+A @code{Bytebpos} represents a byte position in a buffer.  A
-@code{Bytecount} represents the distance between two positions, in bytes.
+@code{Bytecount} represents the distance between two positions, in
-The relationship between @code{Bytind} and @code{Bytecount} is the same
+bytes.  Byte positions in strings use @code{Bytecount}, and for byte
-as the relationship between @code{Bufpos} and @code{Charcount}.
+positions that can be either in a buffer or string, @code{Bytexpos} is
+used.  The relationship between @code{Bytexpos}, @code{Bytebpos} and
+@code{Bytecount} is the same as the relationship between
+@code{Charxpos}, @code{Charbpos} and @code{Charcount}.
 @item Extbyte
-@itemx Extcount
 @cindex Extbyte
-@cindex Extcount
 When dealing with the outside world, XEmacs works with @code{Extbyte}s,
-which are equivalent to @code{unsigned char}.  Obviously, an
+which are equivalent to @code{char}.  The distance between two
-@code{Extcount} is the distance between two @code{Extbyte}s.  Extbytes
+@code{Extbyte}s is a @code{Bytecount}, since external text is a
-and Extcounts are not all that frequent in XEmacs code.
+byte-by-byte encoding.  Extbytes occur mainly at the transition point
+between internal text and external functions.  XEmacs code should not,
+if it can possibly avoid it, do any actual manipulation using external
+text, since its format is completely unpredictable (it might not even be
+ASCII-compatible).
 @end table
 @node Working With Character and Byte Positions
 @subsection Working With Character and Byte Positions
 @cindex character and byte positions, working with
 @file{buffer.h}, and we don't discuss all of them here, but only the
 most important ones.  Examining the existing code is the best way to
 learn about them.
 @table @code
-@item MAX_EMCHAR_LEN
+@item MAX_ICHAR_LEN
-@cindex MAX_EMCHAR_LEN
+@cindex MAX_ICHAR_LEN
 This preprocessor constant is the maximum number of buffer bytes to
 represent an Emacs character in the variable width internal encoding.
 It is useful when allocating temporary strings to keep a known number of
 characters.  For instance:
 @{
 Charcount cclen;
 ...
 @{
 /* Allocate place for @var{cclen} characters. */
-Bufbyte *buf = (Bufbyte *)alloca (cclen * MAX_EMCHAR_LEN);
+Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN);
 ...
 @end group
 @end example
 If you followed the previous section, you can guess that, logically,
-multiplying a @code{Charcount} value with @code{MAX_EMCHAR_LEN} produces
+multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces
 a @code{Bytecount} value.
-In the current Mule implementation, @code{MAX_EMCHAR_LEN} equals 4.
+In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4.
 Without Mule, it is 1.
-@item charptr_emchar
+@item itext_ichar
-@itemx set_charptr_emchar
+@itemx set_itext_ichar
-@cindex charptr_emchar
+@cindex itext_ichar
-@cindex set_charptr_emchar
+@cindex set_itext_ichar
-The @code{charptr_emchar} macro takes a @code{Bufbyte} pointer and
+The @code{itext_ichar} macro takes a @code{Ibyte} pointer and
-returns the @code{Emchar} stored at that position.  If it were a
+returns the @code{Ichar} stored at that position.  If it were a
 function, its prototype would be:
 @example
-Emchar charptr_emchar (Bufbyte *p);
+Ichar itext_ichar (Ibyte *p);
 @end example
-@code{set_charptr_emchar} stores an @code{Emchar} to the specified byte
+@code{set_itext_ichar} stores an @code{Ichar} to the specified byte
 position.  It returns the number of bytes stored:
 @example
-Bytecount set_charptr_emchar (Bufbyte *p, Emchar c);
+Bytecount set_itext_ichar (Ibyte *p, Ichar c);
 @end example
-It is important to note that @code{set_charptr_emchar} is safe only for
+It is important to note that @code{set_itext_ichar} is safe only for
 appending a character at the end of a buffer, not for overwriting a
 character in the middle.  This is because the width of characters
-varies, and @code{set_charptr_emchar} cannot resize the string if it
+varies, and @code{set_itext_ichar} cannot resize the string if it
 writes, say, a two-byte character where a single-byte character used to
 reside.
-A typical use of @code{set_charptr_emchar} can be demonstrated by this
+A typical use of @code{set_itext_ichar} can be demonstrated by this
 example, which copies characters from buffer @var{buf} to a temporary
-string of Bufbytes.
+string of Ibytes.
 @example
 @group
 @{
-Bufpos pos;
+Charbpos pos;
 for (pos = beg; pos < end; pos++)
 @{
-Emchar c = BUF_FETCH_CHAR (buf, pos);
+Ichar c = BUF_FETCH_CHAR (buf, pos);
-p += set_charptr_emchar (buf, c);
+p += set_itext_ichar (buf, c);
 @}
 @}
 @end group
 @end example
-Note how @code{set_charptr_emchar} is used to store the @code{Emchar}
+Note how @code{set_itext_ichar} is used to store the @code{Ichar}
 and increment the counter, at the same time.
-@item INC_CHARPTR
+@item INC_IBYTEPTR
-@itemx DEC_CHARPTR
+@itemx DEC_IBYTEPTR
-@cindex INC_CHARPTR
+@cindex INC_IBYTEPTR
-@cindex DEC_CHARPTR
+@cindex DEC_IBYTEPTR
-These two macros increment and decrement a @code{Bufbyte} pointer,
+These two macros increment and decrement an @code{Ibyte} pointer,
 respectively.  They will adjust the pointer by the appropriate number of
 bytes according to the byte length of the character stored there.  Both
 macros assume that the memory address is located at the beginning of a
 valid character.
-Without Mule support, @code{INC_CHARPTR (p)} and @code{DEC_CHARPTR (p)}
+Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)}
 simply expand to @code{p++} and @code{p--}, respectively.
 @item bytecount_to_charcount
 @cindex bytecount_to_charcount
 Given a pointer to a text string and a length in bytes, return the
 equivalent length in characters.
 @example
-Charcount bytecount_to_charcount (Bufbyte *p, Bytecount bc);
+Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc);
 @end example
 @item charcount_to_bytecount
 @cindex charcount_to_bytecount
 Given a pointer to a text string and a length in characters, return the
 equivalent length in bytes.
 @example
-Bytecount charcount_to_bytecount (Bufbyte *p, Charcount cc);
+Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc);
 @end example
-@item charptr_n_addr
+@item itext_n_addr
-@cindex charptr_n_addr
+@cindex itext_n_addr
 Return a pointer to the beginning of the character offset @var{cc} (in
 characters) from @var{p}.
 @example
-Bufbyte *charptr_n_addr (Bufbyte *p, Charcount cc);
+Ibyte *itext_n_addr (Ibyte *p, Charcount cc);
 @end example
 @end table
 @node Conversion to and from External Data
 @subsection Conversion to and from External Data
 @cindex conversion to and from external data
 @cindex external data, conversion to and from
 When an external function, such as a C library function, returns a
-@code{char} pointer, you should almost never treat it as @code{Bufbyte}.
+@code{char} pointer, you should almost never treat it as @code{Ibyte}.
 This is because these returned strings may contain 8bit characters which
 can be misinterpreted by XEmacs, and cause a crash.  Likewise, when
 exporting a piece of internal text to the outside world, you should
 always convert it to an appropriate external encoding, lest the internal
 stuff (such as the infamous \201 characters) leak out.
 @file{buffer.h}.  There used to be a fixed set of external formats
 supported by these macros, but now any coding system can be used with
 these macros.  The coding system alias mechanism is used to create the
 following logical coding systems, which replace the fixed external
 formats.  The (dontusethis-set-symbol-value-handler) mechanism was
-enhanced to make this possible (more work on that is needed - like
+enhanced to make this possible (more work on that is needed).
-remove the @code{dontusethis-} prefix).
+Example useful coding systems:
 @table @code
 @item Qbinary
 This is the simplest format and is what we use in the absence of a more
 appropriate format.  This converts according to the @code{binary} coding
 @item
 On output, characters 0--255 are converted into bytes 0--255 and other
 characters are converted into `~'.
 @end enumerate
-@item Qfile_name
-Format used for filenames.  This is user-definable via either the
-@code{file-name-coding-system} or @code{pathname-coding-system} (now
-obsolete) variables.
 @item Qnative
 Format used for the external Unix environment---@code{argv[]}, stuff
 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc.
-Currently this is the same as Qfile_name.  The two should be
+This is encoded according to the encoding specified by the current locale.
-distinguished for clarity and possible future separation.
+@item Qfile_name
+Format used for filenames.  This is normally the same as @code{Qnative},
+but the two should be distinguished for clarity and possible future
+separation -- and also because @code{Qfile_name} can be changed using either
+the @code{file-name-coding-system} or @code{pathname-coding-system} (now
+obsolete) variables.
 @item Qctext
-Compound--text format.  This is the standard X11 format used for data
+Compound-text format.  This is the standard X11 format used for data
 stored in properties, selections, and the like.  This is an 8-bit
 no-lock-shift ISO2022 coding system.  This is a real coding system,
-unlike Qfile_name, which is user-definable.
+unlike @code{Qfile_name}, which is user-definable.
+@item Qmswindows_tstr
+Used for external data in all MS Windows functions that are declared to
+accept data of type @code{LPTSTR} or @code{LPCSTR}.  This maps to either
+@code{Qmswindows_multibyte} (a locale-specific encoding, same as
+@code{Qnative}) or @code{Qmswindows_unicode}, depending on whether
+XEmacs is being run under Windows 9X or Windows NT/2000/XP.
 @end table
 There are two fundamental macros to convert between external and
-internal format.
+internal format, as well as various convenience macros to simplify the
+most common operations.
 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and
 @code{TO_EXTERNAL_FORMAT} converts the other way around.  The arguments
 each of these receives are a source type, a source, a sink type, a sink,
 and a coding system (or a symbol naming a coding system).
 @item @code{C_STRING_ALLOCA, ptr,}
 equivalent to @code{ALLOCA (ptr, len_ignored)} on output.
 @item @code{C_STRING_MALLOC, ptr,}
 equivalent to @code{MALLOC (ptr, len_ignored)} on output
 @item @code{C_STRING, ptr,}
-equivalent to @code{DATA, (ptr, strlen (ptr) + 1)} on input
+equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input
 @item @code{LISP_STRING, string,}
 input or output is a Lisp_Object of type string
 @item @code{LISP_BUFFER, buffer,}
 output is written to @code{(point)} in lisp buffer @var{buffer}
 @item @code{LISP_LSTREAM, lstream,}
 input or output is a Lisp_Object of type lstream
 @item @code{LISP_OPAQUE, object,}
 input or output is a Lisp_Object of type opaque
 @end table
-Often, the data is being converted to a '\0'-byte-terminated string,
+A source type of @code{C_STRING} or a sink type of
-which is the format required by many external system C APIs.  For these
+@code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where
-purposes, a source type of @code{C_STRING} or a sink type of
+the external API is not '\0'-byte-clean -- i.e. it expects strings to be
-@code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate.
+terminated with a null byte.  For external API's that are in fact
-Otherwise, we should try to keep XEmacs '\0'-byte-clean, which means
+'\0'-byte-clean, we should of course not use these.
-using (ptr, len) pairs.
 The sinks to be specified must be lvalues, unless they are the lisp
 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}.
+There is no problem using the same lvalue for source and sink.
+Garbage collection is inhibited during these conversion operations, so
+it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}.
 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the
 resulting text is stored in a stack-allocated buffer, which is
 automatically freed on returning from the function.  However, the sink
 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed
 Note that it doesn't make sense for @code{LISP_STRING} to be a source
 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}.
 You'll get an assertion failure if you try.
+99% of conversions involve raw data or Lisp strings as both source and
+sink, and usually data is output as @code{alloca()}, or sometimes
+@code{xmalloc()}.  For this reason, convenience macros are defined for
+many types of conversions involving raw data and/or Lisp strings,
+especially when the output is an @code{alloca()}ed string. (When the
+destination is a Lisp string, there are other functions that should be
+used instead -- @code{build_ext_string()} and @code{make_ext_string()},
+for example.) The convenience macros are of two types -- the older kind
+that store the result into a specified variable, and the newer kind that
+return the result.  The newer kind of macros don't exist when the output
+is sized data, because that would have two return values.  NOTE: All
+convenience macros are ultimately defined in terms of
+@code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}.  Thus, any
+comments above about the workings of these macros also apply to all
+convenience macros.
+A typical old-style convenience macro is
+@example
+C_STRING_TO_EXTERNAL (in, out, codesys);
+@end example
+This is equivalent to
+@example
+TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys);
+@end example
+but is easier to write and somewhat clearer, since it clearly identifies
+the arguments without the clutter of having the preprocessor types mixed
+in.
+The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src,
+codesys)}, which @emph{returns} the converted data (still in
+@code{alloca()} space).  This is far more convenient for most
+operations.
 @node General Guidelines for Writing Mule-Aware Code
 @subsection General Guidelines for Writing Mule-Aware Code
 @cindex writing Mule-aware code, general guidelines for
 @cindex Mule-aware code, general guidelines for writing
 @table @emph
 @item Never use @code{char} and @code{char *}.
 In XEmacs, the use of @code{char} and @code{char *} is almost always a
 mistake.  If you want to manipulate an Emacs character from ``C'', use
-@code{Emchar}.  If you want to examine a specific octet in the internal
+@code{Ichar}.  If you want to examine a specific octet in the internal
-format, use @code{Bufbyte}.  If you want a Lisp-visible character, use a
+format, use @code{Ibyte}.  If you want a Lisp-visible character, use a
 @code{Lisp_Object} and @code{make_char}.  If you want a pointer to move
-through the internal text, use @code{Bufbyte *}.  Also note that you
+through the internal text, use @code{Ibyte *}.  Also note that you
-almost certainly do not need @code{Emchar *}.
+almost certainly do not need @code{Ichar *}.  Other typedefs to clarify
+the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary},
-@item Be careful not to confuse @code{Charcount}, @code{Bytecount}, and @code{Bufpos}.
+@code{UChar_Binary}, and @code{CIbyte}.
+@item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}.
 The whole point of using different types is to avoid confusion about the
 use of certain variables.  Lest this effect be nullified, you need to be
 careful about using the right types.
 @item Always convert external data
 It is extremely important to always convert external data, because
-XEmacs can crash if unexpected 8bit sequences are copied to its internal
+XEmacs can crash if unexpected 8-bit sequences are copied to its internal
 buffers literally.
 This means that when a system function, such as @code{readdir}, returns
 a string, you may need to convert it using one of the conversion macros
 described in the previous chapter, before passing it further to Lisp.
 Actually, most of the basic system functions that accept '\0'-terminated
-string arguments, like @code{stat()} and @code{open()}, have been
+string arguments, like @code{stat()} and @code{open()}, have
-@strong{encapsulated} so that they are they @code{always} do internal to
+@strong{encapsulated} equivalents that do the internal to external
-external conversion themselves.  This means you must pass internally
+conversion themselves.  The encapsulated equivalents have a @code{qxe_}
-encoded data, typically the @code{XSTRING_DATA} of a Lisp_String to
+prefix and have string arguments of type @code{Ibyte *}, and you can
-these functions.  This is actually a design bug, since it unexpectedly
+pass internally encoded data to them, often from a Lisp string using
-changes the semantics of the system functions.  A better design would be
+@code{XSTRING_DATA}. (A better design might be to provide versions that
-to provide separate versions of these system functions that accepted
+accept Lisp strings directly.)
-Lisp_Objects which were lisp strings in place of their current
-@code{char *} arguments.
-@example
-int stat_lisp (Lisp_Object path, struct stat *buf); /* Implement me */
-@end example
 Also note that many internal functions, such as @code{make_string},
-accept Bufbytes, which removes the need for them to convert the data
+accept Ibytes, which removes the need for them to convert the data they
-they receive.  This increases efficiency because that way external data
+receive.  This increases efficiency because that way external data needs
-needs to be decoded only once, when it is read.  After that, it is
+to be decoded only once, when it is read.  After that, it is passed
-passed around in internal format.
+around in internal format.
+@item Do all work in internal format
+External-formatted data is completely unpredictable in its format.  It
+may be Unicode (non-ASCII compatible); it may be a modal encoding, in
+which case some occurrences of (e.g.) the slash character may be part of
+two-byte Asian-language characters, and a naive attempt to split apart a
+pathname by slashes will fail; etc.  Internal-format text should be
+converted to external format only at the point where an external API is
+actually called, and the first thing done after receiving
+external-format text from an external API should be to convert it to
+internal text.
 @end table
 @node An Example of Mule-Aware Code
 @subsection An Example of Mule-Aware Code
 @cindex code, an example of Mule-aware
 DEFUN ("string", Fstring, 0, MANY, 0, /*
 Concatenate all the argument characters and make the result a string.
 */
 (int nargs, Lisp_Object *args))
 @{
-Bufbyte *storage = alloca_array (Bufbyte, nargs * MAX_EMCHAR_LEN);
+Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN);
-Bufbyte *p = storage;
+Ibyte *p = storage;
 for (; nargs; nargs--, args++)
 @{
 Lisp_Object lisp_char = *args;
 CHECK_CHAR_COERCE_INT (lisp_char);
-p += set_charptr_emchar (p, XCHAR (lisp_char));
+p += set_itext_ichar (p, XCHAR (lisp_char));
 @}
 return make_string (storage, p - storage);
 @}
 @end group
 @end example
 Now we can analyze the source line by line.
 Obviously, string will be as long as there are arguments to the
-function.  This is why we allocate @code{MAX_EMCHAR_LEN} * @var{nargs}
+function.  This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs}
 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs}
-@code{Emchar}s to fit in the string.
+@code{Ichar}s to fit in the string.
 Then, the loop checks that each element is a character, converting
 integers in the process.  Like many other functions in XEmacs, this
 function silently accepts integers where characters are expected, for
 historical and compatibility reasons.  Unless you know what you are
 doing, @code{CHECK_CHAR} will also suffice.  @code{XCHAR (lisp_char)}
-extracts the @code{Emchar} from the @code{Lisp_Object}, and
+extracts the @code{Ichar} from the @code{Lisp_Object}, and
-@code{set_charptr_emchar} stores it to storage, increasing @code{p} in
+@code{set_itext_ichar} stores it to storage, increasing @code{p} in
 the process.
 Other instructive examples of correct coding under Mule can be found all
 over the XEmacs code.  For starters, I recommend
 @code{Fnormalize_menu_item_name} in @file{menubar.c}.  After you have
 understood this section of the manual and studied the examples, you can
 proceed writing new Mule-aware code.
+@node Mule-izing Code
+@subsection Mule-izing Code
+A lot of code is written without Mule in mind, and needs to be made
+Mule-correct or "Mule-ized".  There is really no substitute for
+line-by-line analysis when doing this, but the following checklist can
+help:
+@itemize @bullet
+@item
+Check all uses of @code{XSTRING_DATA}.
+@item
+Check all uses of @code{build_string} and @code{make_string}.
+@item
+Check all uses of @code{tolower} and @code{toupper}.
+@item
+Check object print methods.
+@item
+Check for use of functions such as @code{write_c_string},
+@code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}.
+@item
+Check all occurrences of @code{char} and correct to one of the other
+typedefs described above.
+@item
+Check all existing uses of @code{TO_EXTERNAL_FORMAT},
+@code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for
+@samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}).
+@item
+In Windows code, string literals may need to be encapsulated with @code{XETEXT}.
+@end itemize
 @node Techniques for XEmacs Developers
 @section Techniques for XEmacs Developers
 @cindex techniques for XEmacs developers
 @cindex developers, techniques for XEmacs
 @menu
 * Introduction to Buffers::     A buffer holds a block of text such as a file.
 * The Text in a Buffer::        Representation of the text in a buffer.
 * Buffer Lists::                Keeping track of all buffers.
 * Markers and Extents::         Tagging locations within a buffer.
-* Bufbytes and Emchars::        Representation of individual characters.
+* Ibytes and Ichars::        Representation of individual characters.
 * The Buffer Object::           The Lisp object corresponding to a buffer.
 @end menu
 @node Introduction to Buffers
 @section Introduction to Buffers
 For now, we can view a character as some non-negative integer that
 has some shape that defines how it typically appears (e.g. as an
 uppercase A). (The exact way in which a character appears depends on the
 font used to display the character.) The internal type of characters in
-the C code is an @code{Emchar}; this is just an @code{int}, but using a
+the C code is an @code{Ichar}; this is just an @code{int}, but using a
 symbolic type makes the code clearer.
 Between every character in a buffer is a @dfn{buffer position} or
 @dfn{character position}.  We can speak of the character before or after
 a particular buffer position, and when you insert a character at a
 Buffer positions are numbered starting at 1.  This means that
 position 1 is before the first character, and position 0 is not
 valid.  If there are N characters in a buffer, then buffer
 position N+1 is after the last one, and position N+2 is not valid.
-The internal makeup of the Emchar integer varies depending on whether
+The internal makeup of the Ichar integer varies depending on whether
-we have compiled with MULE support.  If not, the Emchar integer is an
+we have compiled with MULE support.  If not, the Ichar integer is an
 8-bit integer with possible values from 0 - 255.  0 - 127 are the
 standard ASCII characters, while 128 - 255 are the characters from the
 ISO-8859-1 character set.  If we have compiled with MULE support, an
-Emchar is a 19-bit integer, with the various bits having meanings
+Ichar is a 19-bit integer, with the various bits having meanings
 according to a complex scheme that will be detailed later.  The
 characters numbered 0 - 255 still have the same meanings as for the
 non-MULE case, though.
 Internally, the text in a buffer is represented in a fairly simple
 the situation is different.  In this case, the space @emph{will} be
 released back to the operating system.  However, this tends to result in a
 noticeable speed penalty.)
 Astute readers may notice that the text in a buffer is represented as
-an array of @emph{bytes}, while (at least in the MULE case) an Emchar is
+an array of @emph{bytes}, while (at least in the MULE case) an Ichar is
 a 19-bit integer, which clearly cannot fit in a byte.  This means (of
 course) that the text in a buffer uses a different representation from
-an Emchar: specifically, the 19-bit Emchar becomes a series of one to
+an Ichar: specifically, the 19-bit Ichar becomes a series of one to
 four bytes.  The conversion between these two representations is complex
 and will be described later.
-In the non-MULE case, everything is very simple: An Emchar
+In the non-MULE case, everything is very simple: An Ichar
 is an 8-bit value, which fits neatly into one byte.
 If we are given a buffer position and want to retrieve the
 character at that position, we need to follow these steps:
 position that is @dfn{at} the gap, we always use the memory position at
 the @emph{beginning}, not at the end, of the gap.
 @item
 Fetch the appropriate bytes at the determined memory position.
 @item
-Convert these bytes into an Emchar.
+Convert these bytes into an Ichar.
 @end enumerate
 In the non-Mule case, (3) and (4) boil down to a simple one-byte
 memory access.
 Note that we have defined three types of positions in a buffer:
 @enumerate
 @item
-@dfn{buffer positions} or @dfn{character positions}, typedef @code{Bufpos}
+@dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos}
 @item
-@dfn{byte indices}, typedef @code{Bytind}
+@dfn{byte indices}, typedef @code{Bytebpos}
 @item
-@dfn{memory indices}, typedef @code{Memind}
+@dfn{memory indices}, typedef @code{Membpos}
 @end enumerate
 All three typedefs are just @code{int}s, but defining them this way makes
 things a lot clearer.
 Most code works with buffer positions.  In particular, all Lisp code
 that refers to text in a buffer uses buffer positions.  Lisp code does
 not know that byte indices or memory indices exist.
 Finally, we have a typedef for the bytes in a buffer.  This is a
-@code{Bufbyte}, which is an unsigned char.  Referring to them as
+@code{Ibyte}, which is an unsigned char.  Referring to them as
-Bufbytes underscores the fact that we are working with a string of bytes
+Ibytes underscores the fact that we are working with a string of bytes
 in the internal Emacs buffer representation rather than in one of a
 number of possible alternative representations (e.g. EUC-encoded text,
 etc.).
 @node Buffer Lists
 The important thing here is that markers and extents simply contain
 buffer positions in them as integers, and every time text is inserted or
 deleted, these positions must be updated.  In order to minimize the
 amount of shuffling that needs to be done, the positions in markers and
-extents (there's one per marker, two per extent) are stored in Meminds.
+extents (there's one per marker, two per extent) are stored in Membpos's.
 This means that they only need to be moved when the text is physically
 moved in memory; since the gap structure tries to minimize this, it also
 minimizes the number of marker and extent indices that need to be
 adjusted.  Look in @file{insdel.c} for the details of how this works.
 is no way to determine what markers are in a buffer if you are just
 given the buffer.  Extents remain in a buffer until they are detached
 (which could happen as a result of text being deleted) or the buffer is
 deleted, and primitives do exist to enumerate the extents in a buffer.
-@node Bufbytes and Emchars
+@node Ibytes and Ichars
-@section Bufbytes and Emchars
+@section Ibytes and Ichars
-@cindex Bufbytes and Emchars
+@cindex Ibytes and Ichars
-@cindex Emchars, Bufbytes and
+@cindex Ichars, Ibytes and
 Not yet documented.
 @node The Buffer Object
 @section The Buffer Object
 @cindex character sets and encodings, Mule
 @cindex encodings, Mule character sets and
 Recall that there are two primary ways that text is represented in
 XEmacs.  The @dfn{buffer} representation sees the text as a series of
-bytes (Bufbytes), with a variable number of bytes used per character.
+bytes (Ibytes), with a variable number of bytes used per character.
 The @dfn{character} representation sees the text as a series of integers
-(Emchars), one per character.  The character representation is a cleaner
+(Ichars), one per character.  The character representation is a cleaner
 representation from a theoretical standpoint, and is thus used in many
 cases when lots of manipulations on a string need to be done.  However,
 the buffer representation is the standard representation used in both
 Lisp strings and buffers, and because of this, it is the ``default''
 representation that text comes in.  The reason for using this
 @deftypefunx int Lstream_fgetc (Lstream *@var{stream})
 @deftypefunx void Lstream_fungetc (Lstream *@var{stream}, int @var{c})
 Function equivalents of the above macros.
 @end deftypefun
-@deftypefun ssize_t Lstream_read (Lstream *@var{stream}, void *@var{data}, size_t @var{size})
+@deftypefun Bytecount Lstream_read (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
 Read @var{size} bytes of @var{data} from the stream.  Return the number
 of bytes read.  0 means EOF. -1 means an error occurred and no bytes
 were read.
 @end deftypefun
-@deftypefun ssize_t Lstream_write (Lstream *@var{stream}, void *@var{data}, size_t @var{size})
+@deftypefun Bytecount Lstream_write (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
 Write @var{size} bytes of @var{data} to the stream.  Return the number
 of bytes written.  -1 means an error occurred and no bytes were written.
 @end deftypefun
-@deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, size_t @var{size})
+@deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
 Push back @var{size} bytes of @var{data} onto the input queue.  The next
 call to @code{Lstream_read()} with the same size will read the same
 bytes back.  Note that this will be the case even if there is other
 pending unread data.
 @end deftypefun
 @node Lstream Methods
 @section Lstream Methods
 @cindex lstream methods
-@deftypefn {Lstream Method} ssize_t reader (Lstream *@var{stream}, unsigned char *@var{data}, size_t @var{size})
+@deftypefn {Lstream Method} Bytecount reader (Lstream *@var{stream}, unsigned char *@var{data}, Bytecount @var{size})
 Read some data from the stream's end and store it into @var{data}, which
 can hold @var{size} bytes.  Return the number of bytes read.  A return
 value of 0 means no bytes can be read at this time.  This may be because
 of an EOF, or because there is a granularity greater than one byte that
 the stream imposes on the returned data, and @var{size} is less than
 calls @code{Lstream_read()} with a very small size.
 This function can be @code{NULL} if the stream is output-only.
 @end deftypefn
-@deftypefn {Lstream Method} ssize_t writer (Lstream *@var{stream}, const unsigned char *@var{data}, size_t @var{size})
+@deftypefn {Lstream Method} Bytecount writer (Lstream *@var{stream}, const unsigned char *@var{data}, Bytecount @var{size})
 Send some data to the stream's end.  Data to be sent is in @var{data}
 and is @var{size} bytes.  Return the number of bytes sent.  This
 function can send and return fewer bytes than is passed in; in that
 case, the function will just be called again until there is no data left
 or 0 is returned.  A return value of 0 means that no more data can be
 Similarly, a string may or may not have an extent_info structure.
 (Generally it won't if there haven't been any extents added to the
 string.) So use the @code{_force} version if you need the extent_info
 structure to be there.
-A list of extents is maintained as a double gap array: one gap array
+A list of extents is maintained as a double gap array: One gap array
 is ordered by start index (the @dfn{display order}) and the other is
 ordered by end index (the @dfn{e-order}).  Note that positions in an
 extent list should logically be conceived of as referring @emph{to} a
 particular extent (as is the norm in programs) rather than sitting
 between two extents.  Note also that callers of these functions should
 not be aware of the fact that the extent list is implemented as an
 array, except for the fact that positions are integers (this should be
 generalized to handle integers and linked list equally well).
+A gap array is the same structure used by buffer text: an array of
+elements with a "gap" somewhere in the middle.  Insertion and deletion
+happens by moving the gap to the insertion/deletion point, and then
+expanding/contracting as necessary.  Gap arrays have a number of
+useful properties:
+@enumerate
+@item
+They are space efficient, as there is no need for next/previous pointers.
+@item
+If the items in them are sorted, locating an item is fast -- @math{O(log N)}.
+@item
+Insertion and deletion is very fast (constant time, essentially) if the
+gap is near (which favors localized operations, as will usually be the
+case).  Even if not, it requires only a block move of memory, which is
+generally a highly optimized operation on modern processors.
+@item
+Code to manipulate them is relatively simple to write.
+@end enumerate
+An alternative would be a balanced binary trees, which have guaranteed
+@math{O(log N)} time for all operations (although the constant factors
+are not as good, and repeated localized operations will be slower than
+for a gap array).  Such code is quite tricky to write, however.
 @node Zero-Length Extents
 @section Zero-Length Extents
 @cindex zero-length extents
 @cindex extents, zero-length
 This is the analog of Theorem 1, and applies because the e-order
 sorts by increasing ending index.
 Therefore, @math{F} can be found in the same amount of time as
 operation (1), i.e. the time that it takes to locate where an extent
-would go if inserted into the e-order list.
+would go if inserted into the e-order list.  This is @math{O(log N)},
+since we are using gap arrays to manage extents.
-If the lists were stored as balanced binary trees, then operation (1)
-would take logarithmic time, which is usually quite fast.  However,
-currently they're stored as simple doubly-linked lists, and instead we
-do some caching to try to speed things up.
 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents
-(ordered in the display order) that overlap an index @math{I}, together
+(ordered in display order and e-order, just like for normal extent
-with the SOE's @dfn{previous} extent, which is an extent that precedes
+lists) that overlap an index @math{I}.
-@math{I} in the e-order. (Hopefully there will not be very many extents
-between @math{I} and the previous extent.)
 Now:
 Let @math{I} be an index, let @math{S} be the stack of extents on
-@math{I}, let @math{F} be the first extent in @math{S}, and let @math{P}
+@math{I} and let @math{F} be the first extent in @math{S}.
-be @math{S}'s previous extent.
 Theorem 3: The first extent in @math{S} is the first extent that overlaps
 any range @math{[I, J]}.
 Proof: Any extent that overlaps @math{[I, J]} but does not include

Mercurial > hg > xemacs-beta

comparison man/internals/internals.texi @ 1261:465bd3c7d932