comparison man/internals/internals.texi @ 1263:bada4b0bce3a

[xemacs-hg @ 2003-02-06 14:37:51 by stephent] nits <87fzr1o4s3.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Thu, 06 Feb 2003 14:37:56 +0000
parents 465bd3c7d932
children 1b0339b048ce
comparison
equal deleted inserted replaced
1262:807c72f959fe 1263:bada4b0bce3a
265 265
266 * Introduction to Buffers:: A buffer holds a block of text such as a file. 266 * Introduction to Buffers:: A buffer holds a block of text such as a file.
267 * The Text in a Buffer:: Representation of the text in a buffer. 267 * The Text in a Buffer:: Representation of the text in a buffer.
268 * Buffer Lists:: Keeping track of all buffers. 268 * Buffer Lists:: Keeping track of all buffers.
269 * Markers and Extents:: Tagging locations within a buffer. 269 * Markers and Extents:: Tagging locations within a buffer.
270 * Ibytes and Ichars:: Representation of individual characters. 270 * Ibytes and Ichars:: Representation of individual characters.
271 * The Buffer Object:: The Lisp object corresponding to a buffer. 271 * The Buffer Object:: The Lisp object corresponding to a buffer.
272 272
273 MULE Character Sets and Encodings 273 MULE Character Sets and Encodings
274 274
275 * Character Sets:: 275 * Character Sets::
2766 @cindex Ichar 2766 @cindex Ichar
2767 An @code{Ichar} holds a single Emacs character. 2767 An @code{Ichar} holds a single Emacs character.
2768 2768
2769 Obviously, the equality between characters and bytes is lost in the Mule 2769 Obviously, the equality between characters and bytes is lost in the Mule
2770 world. Characters can be represented by one or more bytes in the 2770 world. Characters can be represented by one or more bytes in the
2771 buffer, and @code{Ichar} is the C type large enough to hold any 2771 buffer, and @code{Ichar} is a C type large enough to hold any
2772 character. 2772 character.
2773 2773
2774 Without Mule support, an @code{Ichar} is equivalent to an 2774 Without Mule support, an @code{Ichar} is equivalent to an
2775 @code{unsigned char}. 2775 @code{unsigned char}.
2776 2776
2781 2781
2782 XEmacs does not work with the same character formats all the time; when 2782 XEmacs does not work with the same character formats all the time; when
2783 reading characters from the outside, it decodes them to an internal 2783 reading characters from the outside, it decodes them to an internal
2784 format, and likewise encodes them when writing. @code{Ibyte} (in fact 2784 format, and likewise encodes them when writing. @code{Ibyte} (in fact
2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and 2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and
2786 strings format. A @code{Ibyte *} is the type that points at text 2786 strings format. An @code{Ibyte *} is the type that points at text
2787 encoded in the variable-width internal encoding. 2787 encoded in the variable-width internal encoding.
2788 2788
2789 One character can correspond to one or more @code{Ibyte}s. In the 2789 One character can correspond to one or more @code{Ibyte}s. In the
2790 current Mule implementation, an ASCII character is represented by the 2790 current Mule implementation, an ASCII character is represented by the
2791 same @code{Ibyte}, and other characters are represented by a sequence 2791 same @code{Ibyte}, and other characters are represented by a sequence
2985 2985
2986 The interface to conversion between the internal and external 2986 The interface to conversion between the internal and external
2987 representations of text are the numerous conversion macros defined in 2987 representations of text are the numerous conversion macros defined in
2988 @file{buffer.h}. There used to be a fixed set of external formats 2988 @file{buffer.h}. There used to be a fixed set of external formats
2989 supported by these macros, but now any coding system can be used with 2989 supported by these macros, but now any coding system can be used with
2990 these macros. The coding system alias mechanism is used to create the 2990 them. The coding system alias mechanism is used to create the
2991 following logical coding systems, which replace the fixed external 2991 following logical coding systems, which replace the fixed external
2992 formats. The (dontusethis-set-symbol-value-handler) mechanism was 2992 formats. The (dontusethis-set-symbol-value-handler) mechanism was
2993 enhanced to make this possible (more work on that is needed). 2993 enhanced to make this possible (more work on that is needed).
2994 2994
2995 Example useful coding systems: 2995 Often useful coding systems:
2996 2996
2997 @table @code 2997 @table @code
2998 @item Qbinary 2998 @item Qbinary
2999 This is the simplest format and is what we use in the absence of a more 2999 This is the simplest format and is what we use in the absence of a more
3000 appropriate format. This converts according to the @code{binary} coding 3000 appropriate format. This converts according to the @code{binary} coding
3038 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either 3038 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either
3039 @code{Qmswindows_multibyte} (a locale-specific encoding, same as 3039 @code{Qmswindows_multibyte} (a locale-specific encoding, same as
3040 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether 3040 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether
3041 XEmacs is being run under Windows 9X or Windows NT/2000/XP. 3041 XEmacs is being run under Windows 9X or Windows NT/2000/XP.
3042 @end table 3042 @end table
3043
3044 Many other coding systems are provided by default.
3043 3045
3044 There are two fundamental macros to convert between external and 3046 There are two fundamental macros to convert between external and
3045 internal format, as well as various convenience macros to simplify the 3047 internal format, as well as various convenience macros to simplify the
3046 most common operations. 3048 most common operations.
3047 3049
3194 It is extremely important to always convert external data, because 3196 It is extremely important to always convert external data, because
3195 XEmacs can crash if unexpected 8-bit sequences are copied to its internal 3197 XEmacs can crash if unexpected 8-bit sequences are copied to its internal
3196 buffers literally. 3198 buffers literally.
3197 3199
3198 This means that when a system function, such as @code{readdir}, returns 3200 This means that when a system function, such as @code{readdir}, returns
3199 a string, you may need to convert it using one of the conversion macros 3201 a string, you normally need to convert it using one of the conversion macros
3200 described in the previous chapter, before passing it further to Lisp. 3202 described in the previous chapter, before passing it further to Lisp.
3201 3203
3202 Actually, most of the basic system functions that accept '\0'-terminated 3204 Actually, most of the basic system functions that accept '\0'-terminated
3203 string arguments, like @code{stat()} and @code{open()}, have 3205 string arguments, like @code{stat()} and @code{open()}, have
3204 @strong{encapsulated} equivalents that do the internal to external 3206 @strong{encapsulated} equivalents that do the internal to external
3214 to be decoded only once, when it is read. After that, it is passed 3216 to be decoded only once, when it is read. After that, it is passed
3215 around in internal format. 3217 around in internal format.
3216 3218
3217 @item Do all work in internal format 3219 @item Do all work in internal format
3218 External-formatted data is completely unpredictable in its format. It 3220 External-formatted data is completely unpredictable in its format. It
3219 may be Unicode (non-ASCII compatible); it may be a modal encoding, in 3221 may be fixed-width Unicode (not even ASCII compatible); it may be a
3222 modal encoding, in
3220 which case some occurrences of (e.g.) the slash character may be part of 3223 which case some occurrences of (e.g.) the slash character may be part of
3221 two-byte Asian-language characters, and a naive attempt to split apart a 3224 two-byte Asian-language characters, and a naive attempt to split apart a
3222 pathname by slashes will fail; etc. Internal-format text should be 3225 pathname by slashes will fail; etc. Internal-format text should be
3223 converted to external format only at the point where an external API is 3226 converted to external format only at the point where an external API is
3224 actually called, and the first thing done after receiving 3227 actually called, and the first thing done after receiving
8111 @menu 8114 @menu
8112 * Introduction to Buffers:: A buffer holds a block of text such as a file. 8115 * Introduction to Buffers:: A buffer holds a block of text such as a file.
8113 * The Text in a Buffer:: Representation of the text in a buffer. 8116 * The Text in a Buffer:: Representation of the text in a buffer.
8114 * Buffer Lists:: Keeping track of all buffers. 8117 * Buffer Lists:: Keeping track of all buffers.
8115 * Markers and Extents:: Tagging locations within a buffer. 8118 * Markers and Extents:: Tagging locations within a buffer.
8116 * Ibytes and Ichars:: Representation of individual characters. 8119 * Ibytes and Ichars:: Representation of individual characters.
8117 * The Buffer Object:: The Lisp object corresponding to a buffer. 8120 * The Buffer Object:: The Lisp object corresponding to a buffer.
8118 @end menu 8121 @end menu
8119 8122
8120 @node Introduction to Buffers 8123 @node Introduction to Buffers
8121 @section Introduction to Buffers 8124 @section Introduction to Buffers
8196 @dfn{character position}. We can speak of the character before or after 8199 @dfn{character position}. We can speak of the character before or after
8197 a particular buffer position, and when you insert a character at a 8200 a particular buffer position, and when you insert a character at a
8198 particular position, all characters after that position end up at new 8201 particular position, all characters after that position end up at new
8199 positions. When we speak of the character @dfn{at} a position, we 8202 positions. When we speak of the character @dfn{at} a position, we
8200 really mean the character after the position. (This schizophrenia 8203 really mean the character after the position. (This schizophrenia
8201 between a buffer position being ``between'' a character and ``on'' a 8204 between a buffer position being ``between'' two characters and ``on'' a
8202 character is rampant in Emacs.) 8205 character is rampant in Emacs.)
8203 8206
8204 Buffer positions are numbered starting at 1. This means that 8207 Buffer positions are numbered starting at 1. This means that
8205 position 1 is before the first character, and position 0 is not 8208 position 1 is before the first character, and position 0 is not
8206 valid. If there are N characters in a buffer, then buffer 8209 valid. If there are N characters in a buffer, then buffer
9794 Similarly, a string may or may not have an extent_info structure. 9797 Similarly, a string may or may not have an extent_info structure.
9795 (Generally it won't if there haven't been any extents added to the 9798 (Generally it won't if there haven't been any extents added to the
9796 string.) So use the @code{_force} version if you need the extent_info 9799 string.) So use the @code{_force} version if you need the extent_info
9797 structure to be there. 9800 structure to be there.
9798 9801
9799 A list of extents is maintained as a double gap array: One gap array 9802 A list of extents is maintained as a double gap array. One gap array
9800 is ordered by start index (the @dfn{display order}) and the other is 9803 is ordered by start index (the @dfn{display order}) and the other is
9801 ordered by end index (the @dfn{e-order}). Note that positions in an 9804 ordered by end index (the @dfn{e-order}). Note that positions in an
9802 extent list should logically be conceived of as referring @emph{to} a 9805 extent list should logically be conceived of as referring @emph{to} a
9803 particular extent (as is the norm in programs) rather than sitting 9806 particular extent (as is the norm in programs) rather than sitting
9804 between two extents. Note also that callers of these functions should 9807 between two extents. Note also that callers of these functions should
9827 9830
9828 @item 9831 @item
9829 Code to manipulate them is relatively simple to write. 9832 Code to manipulate them is relatively simple to write.
9830 @end enumerate 9833 @end enumerate
9831 9834
9832 An alternative would be a balanced binary trees, which have guaranteed 9835 An alternative would be balanced binary trees, which have guaranteed
9833 @math{O(log N)} time for all operations (although the constant factors 9836 @math{O(log N)} time for all operations (although the constant factors
9834 are not as good, and repeated localized operations will be slower than 9837 are not as good, and repeated localized operations will be slower than
9835 for a gap array). Such code is quite tricky to write, however. 9838 for a gap array). Such code is quite tricky to write, however.
9836 9839
9837 @node Zero-Length Extents 9840 @node Zero-Length Extents