Mercurial > hg > xemacs-beta
comparison man/internals/internals.texi @ 1261:465bd3c7d932
[xemacs-hg @ 2003-02-06 06:35:47 by ben]
various bug fixes
mule/cyril-util.el: Fix compile warning.
loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we
get better debug results.
update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure.
simple.el: Omit M-S-home/end from motion keys.
update-elc.el: Overhaul:
-- allow list of "early-compile" files to be specified, not hardcoded
-- fix autoload checking to include all .el files, not just dumped ones
-- be smarter about regenerating autoloads, so we don't need to use
loadup-el if not necessary
-- use standard methods for loading/not loading auto-autoloads.el
(maybe fixes "Already loaded" error?)
-- rename misleading NOBYTECOMPILE flag file.
window-xemacs.el: Fix bug in default param.
window-xemacs.el: Fix compile warnings.
lwlib-Xm.c: Fix compile warning.
lispref/mule.texi: Lots of Mule rewriting.
internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and
lots of Mule rewriting.
config.inc.samp: Various fixups.
Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE.
esd.c: Warning fixes.
fns.c: Eliminate bogus require-prints-loading-message; use already
existent load-always-display-messages instead. Make sure `load'
knows we are coming from `require'.
lread.c: Turn on `load-warn-when-source-newer' by default. Change loading
message to indicate when we are `require'ing. Eliminate
purify_flag hacks to display more messages; instead, loadup and
friends specify this explicitly with
`load-always-display-messages'. Add spaces when batch to clearly
indicate recursive loading. Fassoc() does not GC so no need to
gcpro.
gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing
of callbacks in lwlib structures.
eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or
backtraces get all screwed up as some values are printed through
the stream console and some aren't. Export
canonicalize_printcharfun() and use in Fbacktrace().
author | ben |
---|---|
date | Thu, 06 Feb 2003 06:36:17 +0000 |
parents | c1553814932e |
children | bada4b0bce3a |
comparison
equal
deleted
inserted
replaced
1260:278c9cd3435e | 1261:465bd3c7d932 |
---|---|
265 | 265 |
266 * Introduction to Buffers:: A buffer holds a block of text such as a file. | 266 * Introduction to Buffers:: A buffer holds a block of text such as a file. |
267 * The Text in a Buffer:: Representation of the text in a buffer. | 267 * The Text in a Buffer:: Representation of the text in a buffer. |
268 * Buffer Lists:: Keeping track of all buffers. | 268 * Buffer Lists:: Keeping track of all buffers. |
269 * Markers and Extents:: Tagging locations within a buffer. | 269 * Markers and Extents:: Tagging locations within a buffer. |
270 * Bufbytes and Emchars:: Representation of individual characters. | 270 * Ibytes and Ichars:: Representation of individual characters. |
271 * The Buffer Object:: The Lisp object corresponding to a buffer. | 271 * The Buffer Object:: The Lisp object corresponding to a buffer. |
272 | 272 |
273 MULE Character Sets and Encodings | 273 MULE Character Sets and Encodings |
274 | 274 |
275 * Character Sets:: | 275 * Character Sets:: |
2746 * Character-Related Data Types:: | 2746 * Character-Related Data Types:: |
2747 * Working With Character and Byte Positions:: | 2747 * Working With Character and Byte Positions:: |
2748 * Conversion to and from External Data:: | 2748 * Conversion to and from External Data:: |
2749 * General Guidelines for Writing Mule-Aware Code:: | 2749 * General Guidelines for Writing Mule-Aware Code:: |
2750 * An Example of Mule-Aware Code:: | 2750 * An Example of Mule-Aware Code:: |
2751 * Mule-izing Code:: | |
2751 @end menu | 2752 @end menu |
2752 | 2753 |
2753 @node Character-Related Data Types | 2754 @node Character-Related Data Types |
2754 @subsection Character-Related Data Types | 2755 @subsection Character-Related Data Types |
2755 @cindex character-related data types | 2756 @cindex character-related data types |
2756 @cindex data types, character-related | 2757 @cindex data types, character-related |
2757 | 2758 |
2758 First, let's review the basic character-related datatypes used by | 2759 First, let's review the basic character-related datatypes used by |
2759 XEmacs. Note that the separate @code{typedef}s are not mandatory in the | 2760 XEmacs. Note that some of the separate @code{typedef}s are not |
2760 current implementation (all of them boil down to @code{unsigned char} or | 2761 mandatory, but they improve clarity of code a great deal, because one |
2761 @code{int}), but they improve clarity of code a great deal, because one | |
2762 glance at the declaration can tell the intended use of the variable. | 2762 glance at the declaration can tell the intended use of the variable. |
2763 | 2763 |
2764 @table @code | 2764 @table @code |
2765 @item Emchar | 2765 @item Ichar |
2766 @cindex Emchar | 2766 @cindex Ichar |
2767 An @code{Emchar} holds a single Emacs character. | 2767 An @code{Ichar} holds a single Emacs character. |
2768 | 2768 |
2769 Obviously, the equality between characters and bytes is lost in the Mule | 2769 Obviously, the equality between characters and bytes is lost in the Mule |
2770 world. Characters can be represented by one or more bytes in the | 2770 world. Characters can be represented by one or more bytes in the |
2771 buffer, and @code{Emchar} is the C type large enough to hold any | 2771 buffer, and @code{Ichar} is the C type large enough to hold any |
2772 character. | 2772 character. |
2773 | 2773 |
2774 Without Mule support, an @code{Emchar} is equivalent to an | 2774 Without Mule support, an @code{Ichar} is equivalent to an |
2775 @code{unsigned char}. | 2775 @code{unsigned char}. |
2776 | 2776 |
2777 @item Bufbyte | 2777 @item Ibyte |
2778 @cindex Bufbyte | 2778 @cindex Ibyte |
2779 The data representing the text in a buffer or string is logically a set | 2779 The data representing the text in a buffer or string is logically a set |
2780 of @code{Bufbyte}s. | 2780 of @code{Ibyte}s. |
2781 | 2781 |
2782 XEmacs does not work with the same character formats all the time; when | 2782 XEmacs does not work with the same character formats all the time; when |
2783 reading characters from the outside, it decodes them to an internal | 2783 reading characters from the outside, it decodes them to an internal |
2784 format, and likewise encodes them when writing. @code{Bufbyte} (in fact | 2784 format, and likewise encodes them when writing. @code{Ibyte} (in fact |
2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and | 2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and |
2786 strings format. A @code{Bufbyte *} is the type that points at text | 2786 strings format. A @code{Ibyte *} is the type that points at text |
2787 encoded in the variable-width internal encoding. | 2787 encoded in the variable-width internal encoding. |
2788 | 2788 |
2789 One character can correspond to one or more @code{Bufbyte}s. In the | 2789 One character can correspond to one or more @code{Ibyte}s. In the |
2790 current Mule implementation, an ASCII character is represented by the | 2790 current Mule implementation, an ASCII character is represented by the |
2791 same @code{Bufbyte}, and other characters are represented by a sequence | 2791 same @code{Ibyte}, and other characters are represented by a sequence |
2792 of two or more @code{Bufbyte}s. | 2792 of two or more @code{Ibyte}s. |
2793 | 2793 |
2794 Without Mule support, there are exactly 256 characters, implicitly | 2794 Without Mule support, there are exactly 256 characters, implicitly |
2795 Latin-1, and each character is represented using one @code{Bufbyte}, and | 2795 Latin-1, and each character is represented using one @code{Ibyte}, and |
2796 there is a one-to-one correspondence between @code{Bufbyte}s and | 2796 there is a one-to-one correspondence between @code{Ibyte}s and |
2797 @code{Emchar}s. | 2797 @code{Ichar}s. |
2798 | 2798 |
2799 @item Bufpos | 2799 @item Charxpos |
2800 @item Charbpos | |
2800 @itemx Charcount | 2801 @itemx Charcount |
2801 @cindex Bufpos | 2802 @cindex Charxpos |
2803 @cindex Charbpos | |
2802 @cindex Charcount | 2804 @cindex Charcount |
2803 A @code{Bufpos} represents a character position in a buffer or string. | 2805 A @code{Charbpos} represents a character position in a buffer. A |
2804 A @code{Charcount} represents a number (count) of characters. | 2806 @code{Charcount} represents a number (count) of characters. Logically, |
2805 Logically, subtracting two @code{Bufpos} values yields a | 2807 subtracting two @code{Charbpos} values yields a @code{Charcount} value. |
2806 @code{Charcount} value. Although all of these are @code{typedef}ed to | 2808 When representing a character position in a string, we just use |
2809 @code{Charcount} directly. The reason for having a separate typedef for | |
2810 buffer positions is that they are 1-based, whereas string positions are | |
2811 0-based and hence string counts and positions can be freely intermixed (a | |
2812 string position is equivalent to the count of characters from the | |
2813 beginning). When representing a character position that could be either | |
2814 in a buffer or string (for example, in the extent code), @code{Charxpos} | |
2815 is used. Although all of these are @code{typedef}ed to | |
2807 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make | 2816 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make |
2808 it clear what sort of position is being used. | 2817 it clear what sort of position is being used. |
2809 | 2818 |
2810 @code{Bufpos} and @code{Charcount} values are the only ones that are | 2819 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the |
2811 ever visible to Lisp. | 2820 only ones that are ever visible to Lisp. |
2812 | 2821 |
2813 @item Bytind | 2822 @item Bytexpos |
2814 @itemx Bytecount | 2823 @itemx Bytecount |
2815 @cindex Bytind | 2824 @cindex Bytebpos |
2816 @cindex Bytecount | 2825 @cindex Bytecount |
2817 A @code{Bytind} represents a byte position in a buffer or string. A | 2826 A @code{Bytebpos} represents a byte position in a buffer. A |
2818 @code{Bytecount} represents the distance between two positions, in bytes. | 2827 @code{Bytecount} represents the distance between two positions, in |
2819 The relationship between @code{Bytind} and @code{Bytecount} is the same | 2828 bytes. Byte positions in strings use @code{Bytecount}, and for byte |
2820 as the relationship between @code{Bufpos} and @code{Charcount}. | 2829 positions that can be either in a buffer or string, @code{Bytexpos} is |
2830 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and | |
2831 @code{Bytecount} is the same as the relationship between | |
2832 @code{Charxpos}, @code{Charbpos} and @code{Charcount}. | |
2821 | 2833 |
2822 @item Extbyte | 2834 @item Extbyte |
2823 @itemx Extcount | |
2824 @cindex Extbyte | 2835 @cindex Extbyte |
2825 @cindex Extcount | |
2826 When dealing with the outside world, XEmacs works with @code{Extbyte}s, | 2836 When dealing with the outside world, XEmacs works with @code{Extbyte}s, |
2827 which are equivalent to @code{unsigned char}. Obviously, an | 2837 which are equivalent to @code{char}. The distance between two |
2828 @code{Extcount} is the distance between two @code{Extbyte}s. Extbytes | 2838 @code{Extbyte}s is a @code{Bytecount}, since external text is a |
2829 and Extcounts are not all that frequent in XEmacs code. | 2839 byte-by-byte encoding. Extbytes occur mainly at the transition point |
2840 between internal text and external functions. XEmacs code should not, | |
2841 if it can possibly avoid it, do any actual manipulation using external | |
2842 text, since its format is completely unpredictable (it might not even be | |
2843 ASCII-compatible). | |
2830 @end table | 2844 @end table |
2831 | 2845 |
2832 @node Working With Character and Byte Positions | 2846 @node Working With Character and Byte Positions |
2833 @subsection Working With Character and Byte Positions | 2847 @subsection Working With Character and Byte Positions |
2834 @cindex character and byte positions, working with | 2848 @cindex character and byte positions, working with |
2841 @file{buffer.h}, and we don't discuss all of them here, but only the | 2855 @file{buffer.h}, and we don't discuss all of them here, but only the |
2842 most important ones. Examining the existing code is the best way to | 2856 most important ones. Examining the existing code is the best way to |
2843 learn about them. | 2857 learn about them. |
2844 | 2858 |
2845 @table @code | 2859 @table @code |
2846 @item MAX_EMCHAR_LEN | 2860 @item MAX_ICHAR_LEN |
2847 @cindex MAX_EMCHAR_LEN | 2861 @cindex MAX_ICHAR_LEN |
2848 This preprocessor constant is the maximum number of buffer bytes to | 2862 This preprocessor constant is the maximum number of buffer bytes to |
2849 represent an Emacs character in the variable width internal encoding. | 2863 represent an Emacs character in the variable width internal encoding. |
2850 It is useful when allocating temporary strings to keep a known number of | 2864 It is useful when allocating temporary strings to keep a known number of |
2851 characters. For instance: | 2865 characters. For instance: |
2852 | 2866 |
2855 @{ | 2869 @{ |
2856 Charcount cclen; | 2870 Charcount cclen; |
2857 ... | 2871 ... |
2858 @{ | 2872 @{ |
2859 /* Allocate place for @var{cclen} characters. */ | 2873 /* Allocate place for @var{cclen} characters. */ |
2860 Bufbyte *buf = (Bufbyte *)alloca (cclen * MAX_EMCHAR_LEN); | 2874 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN); |
2861 ... | 2875 ... |
2862 @end group | 2876 @end group |
2863 @end example | 2877 @end example |
2864 | 2878 |
2865 If you followed the previous section, you can guess that, logically, | 2879 If you followed the previous section, you can guess that, logically, |
2866 multiplying a @code{Charcount} value with @code{MAX_EMCHAR_LEN} produces | 2880 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces |
2867 a @code{Bytecount} value. | 2881 a @code{Bytecount} value. |
2868 | 2882 |
2869 In the current Mule implementation, @code{MAX_EMCHAR_LEN} equals 4. | 2883 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4. |
2870 Without Mule, it is 1. | 2884 Without Mule, it is 1. |
2871 | 2885 |
2872 @item charptr_emchar | 2886 @item itext_ichar |
2873 @itemx set_charptr_emchar | 2887 @itemx set_itext_ichar |
2874 @cindex charptr_emchar | 2888 @cindex itext_ichar |
2875 @cindex set_charptr_emchar | 2889 @cindex set_itext_ichar |
2876 The @code{charptr_emchar} macro takes a @code{Bufbyte} pointer and | 2890 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and |
2877 returns the @code{Emchar} stored at that position. If it were a | 2891 returns the @code{Ichar} stored at that position. If it were a |
2878 function, its prototype would be: | 2892 function, its prototype would be: |
2879 | 2893 |
2880 @example | 2894 @example |
2881 Emchar charptr_emchar (Bufbyte *p); | 2895 Ichar itext_ichar (Ibyte *p); |
2882 @end example | 2896 @end example |
2883 | 2897 |
2884 @code{set_charptr_emchar} stores an @code{Emchar} to the specified byte | 2898 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte |
2885 position. It returns the number of bytes stored: | 2899 position. It returns the number of bytes stored: |
2886 | 2900 |
2887 @example | 2901 @example |
2888 Bytecount set_charptr_emchar (Bufbyte *p, Emchar c); | 2902 Bytecount set_itext_ichar (Ibyte *p, Ichar c); |
2889 @end example | 2903 @end example |
2890 | 2904 |
2891 It is important to note that @code{set_charptr_emchar} is safe only for | 2905 It is important to note that @code{set_itext_ichar} is safe only for |
2892 appending a character at the end of a buffer, not for overwriting a | 2906 appending a character at the end of a buffer, not for overwriting a |
2893 character in the middle. This is because the width of characters | 2907 character in the middle. This is because the width of characters |
2894 varies, and @code{set_charptr_emchar} cannot resize the string if it | 2908 varies, and @code{set_itext_ichar} cannot resize the string if it |
2895 writes, say, a two-byte character where a single-byte character used to | 2909 writes, say, a two-byte character where a single-byte character used to |
2896 reside. | 2910 reside. |
2897 | 2911 |
2898 A typical use of @code{set_charptr_emchar} can be demonstrated by this | 2912 A typical use of @code{set_itext_ichar} can be demonstrated by this |
2899 example, which copies characters from buffer @var{buf} to a temporary | 2913 example, which copies characters from buffer @var{buf} to a temporary |
2900 string of Bufbytes. | 2914 string of Ibytes. |
2901 | 2915 |
2902 @example | 2916 @example |
2903 @group | 2917 @group |
2904 @{ | 2918 @{ |
2905 Bufpos pos; | 2919 Charbpos pos; |
2906 for (pos = beg; pos < end; pos++) | 2920 for (pos = beg; pos < end; pos++) |
2907 @{ | 2921 @{ |
2908 Emchar c = BUF_FETCH_CHAR (buf, pos); | 2922 Ichar c = BUF_FETCH_CHAR (buf, pos); |
2909 p += set_charptr_emchar (buf, c); | 2923 p += set_itext_ichar (buf, c); |
2910 @} | 2924 @} |
2911 @} | 2925 @} |
2912 @end group | 2926 @end group |
2913 @end example | 2927 @end example |
2914 | 2928 |
2915 Note how @code{set_charptr_emchar} is used to store the @code{Emchar} | 2929 Note how @code{set_itext_ichar} is used to store the @code{Ichar} |
2916 and increment the counter, at the same time. | 2930 and increment the counter, at the same time. |
2917 | 2931 |
2918 @item INC_CHARPTR | 2932 @item INC_IBYTEPTR |
2919 @itemx DEC_CHARPTR | 2933 @itemx DEC_IBYTEPTR |
2920 @cindex INC_CHARPTR | 2934 @cindex INC_IBYTEPTR |
2921 @cindex DEC_CHARPTR | 2935 @cindex DEC_IBYTEPTR |
2922 These two macros increment and decrement a @code{Bufbyte} pointer, | 2936 These two macros increment and decrement an @code{Ibyte} pointer, |
2923 respectively. They will adjust the pointer by the appropriate number of | 2937 respectively. They will adjust the pointer by the appropriate number of |
2924 bytes according to the byte length of the character stored there. Both | 2938 bytes according to the byte length of the character stored there. Both |
2925 macros assume that the memory address is located at the beginning of a | 2939 macros assume that the memory address is located at the beginning of a |
2926 valid character. | 2940 valid character. |
2927 | 2941 |
2928 Without Mule support, @code{INC_CHARPTR (p)} and @code{DEC_CHARPTR (p)} | 2942 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)} |
2929 simply expand to @code{p++} and @code{p--}, respectively. | 2943 simply expand to @code{p++} and @code{p--}, respectively. |
2930 | 2944 |
2931 @item bytecount_to_charcount | 2945 @item bytecount_to_charcount |
2932 @cindex bytecount_to_charcount | 2946 @cindex bytecount_to_charcount |
2933 Given a pointer to a text string and a length in bytes, return the | 2947 Given a pointer to a text string and a length in bytes, return the |
2934 equivalent length in characters. | 2948 equivalent length in characters. |
2935 | 2949 |
2936 @example | 2950 @example |
2937 Charcount bytecount_to_charcount (Bufbyte *p, Bytecount bc); | 2951 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc); |
2938 @end example | 2952 @end example |
2939 | 2953 |
2940 @item charcount_to_bytecount | 2954 @item charcount_to_bytecount |
2941 @cindex charcount_to_bytecount | 2955 @cindex charcount_to_bytecount |
2942 Given a pointer to a text string and a length in characters, return the | 2956 Given a pointer to a text string and a length in characters, return the |
2943 equivalent length in bytes. | 2957 equivalent length in bytes. |
2944 | 2958 |
2945 @example | 2959 @example |
2946 Bytecount charcount_to_bytecount (Bufbyte *p, Charcount cc); | 2960 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc); |
2947 @end example | 2961 @end example |
2948 | 2962 |
2949 @item charptr_n_addr | 2963 @item itext_n_addr |
2950 @cindex charptr_n_addr | 2964 @cindex itext_n_addr |
2951 Return a pointer to the beginning of the character offset @var{cc} (in | 2965 Return a pointer to the beginning of the character offset @var{cc} (in |
2952 characters) from @var{p}. | 2966 characters) from @var{p}. |
2953 | 2967 |
2954 @example | 2968 @example |
2955 Bufbyte *charptr_n_addr (Bufbyte *p, Charcount cc); | 2969 Ibyte *itext_n_addr (Ibyte *p, Charcount cc); |
2956 @end example | 2970 @end example |
2957 @end table | 2971 @end table |
2958 | 2972 |
2959 @node Conversion to and from External Data | 2973 @node Conversion to and from External Data |
2960 @subsection Conversion to and from External Data | 2974 @subsection Conversion to and from External Data |
2961 @cindex conversion to and from external data | 2975 @cindex conversion to and from external data |
2962 @cindex external data, conversion to and from | 2976 @cindex external data, conversion to and from |
2963 | 2977 |
2964 When an external function, such as a C library function, returns a | 2978 When an external function, such as a C library function, returns a |
2965 @code{char} pointer, you should almost never treat it as @code{Bufbyte}. | 2979 @code{char} pointer, you should almost never treat it as @code{Ibyte}. |
2966 This is because these returned strings may contain 8bit characters which | 2980 This is because these returned strings may contain 8bit characters which |
2967 can be misinterpreted by XEmacs, and cause a crash. Likewise, when | 2981 can be misinterpreted by XEmacs, and cause a crash. Likewise, when |
2968 exporting a piece of internal text to the outside world, you should | 2982 exporting a piece of internal text to the outside world, you should |
2969 always convert it to an appropriate external encoding, lest the internal | 2983 always convert it to an appropriate external encoding, lest the internal |
2970 stuff (such as the infamous \201 characters) leak out. | 2984 stuff (such as the infamous \201 characters) leak out. |
2974 @file{buffer.h}. There used to be a fixed set of external formats | 2988 @file{buffer.h}. There used to be a fixed set of external formats |
2975 supported by these macros, but now any coding system can be used with | 2989 supported by these macros, but now any coding system can be used with |
2976 these macros. The coding system alias mechanism is used to create the | 2990 these macros. The coding system alias mechanism is used to create the |
2977 following logical coding systems, which replace the fixed external | 2991 following logical coding systems, which replace the fixed external |
2978 formats. The (dontusethis-set-symbol-value-handler) mechanism was | 2992 formats. The (dontusethis-set-symbol-value-handler) mechanism was |
2979 enhanced to make this possible (more work on that is needed - like | 2993 enhanced to make this possible (more work on that is needed). |
2980 remove the @code{dontusethis-} prefix). | 2994 |
2995 Example useful coding systems: | |
2981 | 2996 |
2982 @table @code | 2997 @table @code |
2983 @item Qbinary | 2998 @item Qbinary |
2984 This is the simplest format and is what we use in the absence of a more | 2999 This is the simplest format and is what we use in the absence of a more |
2985 appropriate format. This converts according to the @code{binary} coding | 3000 appropriate format. This converts according to the @code{binary} coding |
2998 @item | 3013 @item |
2999 On output, characters 0--255 are converted into bytes 0--255 and other | 3014 On output, characters 0--255 are converted into bytes 0--255 and other |
3000 characters are converted into `~'. | 3015 characters are converted into `~'. |
3001 @end enumerate | 3016 @end enumerate |
3002 | 3017 |
3003 @item Qfile_name | |
3004 Format used for filenames. This is user-definable via either the | |
3005 @code{file-name-coding-system} or @code{pathname-coding-system} (now | |
3006 obsolete) variables. | |
3007 | |
3008 @item Qnative | 3018 @item Qnative |
3009 Format used for the external Unix environment---@code{argv[]}, stuff | 3019 Format used for the external Unix environment---@code{argv[]}, stuff |
3010 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc. | 3020 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc. |
3011 Currently this is the same as Qfile_name. The two should be | 3021 This is encoded according to the encoding specified by the current locale. |
3012 distinguished for clarity and possible future separation. | 3022 |
3023 @item Qfile_name | |
3024 Format used for filenames. This is normally the same as @code{Qnative}, | |
3025 but the two should be distinguished for clarity and possible future | |
3026 separation -- and also because @code{Qfile_name} can be changed using either | |
3027 the @code{file-name-coding-system} or @code{pathname-coding-system} (now | |
3028 obsolete) variables. | |
3013 | 3029 |
3014 @item Qctext | 3030 @item Qctext |
3015 Compound--text format. This is the standard X11 format used for data | 3031 Compound-text format. This is the standard X11 format used for data |
3016 stored in properties, selections, and the like. This is an 8-bit | 3032 stored in properties, selections, and the like. This is an 8-bit |
3017 no-lock-shift ISO2022 coding system. This is a real coding system, | 3033 no-lock-shift ISO2022 coding system. This is a real coding system, |
3018 unlike Qfile_name, which is user-definable. | 3034 unlike @code{Qfile_name}, which is user-definable. |
3035 | |
3036 @item Qmswindows_tstr | |
3037 Used for external data in all MS Windows functions that are declared to | |
3038 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either | |
3039 @code{Qmswindows_multibyte} (a locale-specific encoding, same as | |
3040 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether | |
3041 XEmacs is being run under Windows 9X or Windows NT/2000/XP. | |
3019 @end table | 3042 @end table |
3020 | 3043 |
3021 There are two fundamental macros to convert between external and | 3044 There are two fundamental macros to convert between external and |
3022 internal format. | 3045 internal format, as well as various convenience macros to simplify the |
3046 most common operations. | |
3023 | 3047 |
3024 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and | 3048 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and |
3025 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments | 3049 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments |
3026 each of these receives are a source type, a source, a sink type, a sink, | 3050 each of these receives are a source type, a source, a sink type, a sink, |
3027 and a coding system (or a symbol naming a coding system). | 3051 and a coding system (or a symbol naming a coding system). |
3065 @item @code{C_STRING_ALLOCA, ptr,} | 3089 @item @code{C_STRING_ALLOCA, ptr,} |
3066 equivalent to @code{ALLOCA (ptr, len_ignored)} on output. | 3090 equivalent to @code{ALLOCA (ptr, len_ignored)} on output. |
3067 @item @code{C_STRING_MALLOC, ptr,} | 3091 @item @code{C_STRING_MALLOC, ptr,} |
3068 equivalent to @code{MALLOC (ptr, len_ignored)} on output | 3092 equivalent to @code{MALLOC (ptr, len_ignored)} on output |
3069 @item @code{C_STRING, ptr,} | 3093 @item @code{C_STRING, ptr,} |
3070 equivalent to @code{DATA, (ptr, strlen (ptr) + 1)} on input | 3094 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input |
3071 @item @code{LISP_STRING, string,} | 3095 @item @code{LISP_STRING, string,} |
3072 input or output is a Lisp_Object of type string | 3096 input or output is a Lisp_Object of type string |
3073 @item @code{LISP_BUFFER, buffer,} | 3097 @item @code{LISP_BUFFER, buffer,} |
3074 output is written to @code{(point)} in lisp buffer @var{buffer} | 3098 output is written to @code{(point)} in lisp buffer @var{buffer} |
3075 @item @code{LISP_LSTREAM, lstream,} | 3099 @item @code{LISP_LSTREAM, lstream,} |
3076 input or output is a Lisp_Object of type lstream | 3100 input or output is a Lisp_Object of type lstream |
3077 @item @code{LISP_OPAQUE, object,} | 3101 @item @code{LISP_OPAQUE, object,} |
3078 input or output is a Lisp_Object of type opaque | 3102 input or output is a Lisp_Object of type opaque |
3079 @end table | 3103 @end table |
3080 | 3104 |
3081 Often, the data is being converted to a '\0'-byte-terminated string, | 3105 A source type of @code{C_STRING} or a sink type of |
3082 which is the format required by many external system C APIs. For these | 3106 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where |
3083 purposes, a source type of @code{C_STRING} or a sink type of | 3107 the external API is not '\0'-byte-clean -- i.e. it expects strings to be |
3084 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate. | 3108 terminated with a null byte. For external API's that are in fact |
3085 Otherwise, we should try to keep XEmacs '\0'-byte-clean, which means | 3109 '\0'-byte-clean, we should of course not use these. |
3086 using (ptr, len) pairs. | |
3087 | 3110 |
3088 The sinks to be specified must be lvalues, unless they are the lisp | 3111 The sinks to be specified must be lvalues, unless they are the lisp |
3089 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}. | 3112 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}. |
3113 | |
3114 There is no problem using the same lvalue for source and sink. | |
3115 | |
3116 Garbage collection is inhibited during these conversion operations, so | |
3117 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}. | |
3090 | 3118 |
3091 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the | 3119 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the |
3092 resulting text is stored in a stack-allocated buffer, which is | 3120 resulting text is stored in a stack-allocated buffer, which is |
3093 automatically freed on returning from the function. However, the sink | 3121 automatically freed on returning from the function. However, the sink |
3094 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed | 3122 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed |
3097 | 3125 |
3098 Note that it doesn't make sense for @code{LISP_STRING} to be a source | 3126 Note that it doesn't make sense for @code{LISP_STRING} to be a source |
3099 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}. | 3127 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}. |
3100 You'll get an assertion failure if you try. | 3128 You'll get an assertion failure if you try. |
3101 | 3129 |
3130 99% of conversions involve raw data or Lisp strings as both source and | |
3131 sink, and usually data is output as @code{alloca()}, or sometimes | |
3132 @code{xmalloc()}. For this reason, convenience macros are defined for | |
3133 many types of conversions involving raw data and/or Lisp strings, | |
3134 especially when the output is an @code{alloca()}ed string. (When the | |
3135 destination is a Lisp string, there are other functions that should be | |
3136 used instead -- @code{build_ext_string()} and @code{make_ext_string()}, | |
3137 for example.) The convenience macros are of two types -- the older kind | |
3138 that store the result into a specified variable, and the newer kind that | |
3139 return the result. The newer kind of macros don't exist when the output | |
3140 is sized data, because that would have two return values. NOTE: All | |
3141 convenience macros are ultimately defined in terms of | |
3142 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any | |
3143 comments above about the workings of these macros also apply to all | |
3144 convenience macros. | |
3145 | |
3146 A typical old-style convenience macro is | |
3147 | |
3148 @example | |
3149 C_STRING_TO_EXTERNAL (in, out, codesys); | |
3150 @end example | |
3151 | |
3152 This is equivalent to | |
3153 | |
3154 @example | |
3155 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys); | |
3156 @end example | |
3157 | |
3158 but is easier to write and somewhat clearer, since it clearly identifies | |
3159 the arguments without the clutter of having the preprocessor types mixed | |
3160 in. | |
3161 | |
3162 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src, | |
3163 codesys)}, which @emph{returns} the converted data (still in | |
3164 @code{alloca()} space). This is far more convenient for most | |
3165 operations. | |
3102 | 3166 |
3103 @node General Guidelines for Writing Mule-Aware Code | 3167 @node General Guidelines for Writing Mule-Aware Code |
3104 @subsection General Guidelines for Writing Mule-Aware Code | 3168 @subsection General Guidelines for Writing Mule-Aware Code |
3105 @cindex writing Mule-aware code, general guidelines for | 3169 @cindex writing Mule-aware code, general guidelines for |
3106 @cindex Mule-aware code, general guidelines for writing | 3170 @cindex Mule-aware code, general guidelines for writing |
3111 | 3175 |
3112 @table @emph | 3176 @table @emph |
3113 @item Never use @code{char} and @code{char *}. | 3177 @item Never use @code{char} and @code{char *}. |
3114 In XEmacs, the use of @code{char} and @code{char *} is almost always a | 3178 In XEmacs, the use of @code{char} and @code{char *} is almost always a |
3115 mistake. If you want to manipulate an Emacs character from ``C'', use | 3179 mistake. If you want to manipulate an Emacs character from ``C'', use |
3116 @code{Emchar}. If you want to examine a specific octet in the internal | 3180 @code{Ichar}. If you want to examine a specific octet in the internal |
3117 format, use @code{Bufbyte}. If you want a Lisp-visible character, use a | 3181 format, use @code{Ibyte}. If you want a Lisp-visible character, use a |
3118 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move | 3182 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move |
3119 through the internal text, use @code{Bufbyte *}. Also note that you | 3183 through the internal text, use @code{Ibyte *}. Also note that you |
3120 almost certainly do not need @code{Emchar *}. | 3184 almost certainly do not need @code{Ichar *}. Other typedefs to clarify |
3121 | 3185 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary}, |
3122 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, and @code{Bufpos}. | 3186 @code{UChar_Binary}, and @code{CIbyte}. |
3187 | |
3188 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}. | |
3123 The whole point of using different types is to avoid confusion about the | 3189 The whole point of using different types is to avoid confusion about the |
3124 use of certain variables. Lest this effect be nullified, you need to be | 3190 use of certain variables. Lest this effect be nullified, you need to be |
3125 careful about using the right types. | 3191 careful about using the right types. |
3126 | 3192 |
3127 @item Always convert external data | 3193 @item Always convert external data |
3128 It is extremely important to always convert external data, because | 3194 It is extremely important to always convert external data, because |
3129 XEmacs can crash if unexpected 8bit sequences are copied to its internal | 3195 XEmacs can crash if unexpected 8-bit sequences are copied to its internal |
3130 buffers literally. | 3196 buffers literally. |
3131 | 3197 |
3132 This means that when a system function, such as @code{readdir}, returns | 3198 This means that when a system function, such as @code{readdir}, returns |
3133 a string, you may need to convert it using one of the conversion macros | 3199 a string, you may need to convert it using one of the conversion macros |
3134 described in the previous chapter, before passing it further to Lisp. | 3200 described in the previous chapter, before passing it further to Lisp. |
3135 | 3201 |
3136 Actually, most of the basic system functions that accept '\0'-terminated | 3202 Actually, most of the basic system functions that accept '\0'-terminated |
3137 string arguments, like @code{stat()} and @code{open()}, have been | 3203 string arguments, like @code{stat()} and @code{open()}, have |
3138 @strong{encapsulated} so that they are they @code{always} do internal to | 3204 @strong{encapsulated} equivalents that do the internal to external |
3139 external conversion themselves. This means you must pass internally | 3205 conversion themselves. The encapsulated equivalents have a @code{qxe_} |
3140 encoded data, typically the @code{XSTRING_DATA} of a Lisp_String to | 3206 prefix and have string arguments of type @code{Ibyte *}, and you can |
3141 these functions. This is actually a design bug, since it unexpectedly | 3207 pass internally encoded data to them, often from a Lisp string using |
3142 changes the semantics of the system functions. A better design would be | 3208 @code{XSTRING_DATA}. (A better design might be to provide versions that |
3143 to provide separate versions of these system functions that accepted | 3209 accept Lisp strings directly.) |
3144 Lisp_Objects which were lisp strings in place of their current | |
3145 @code{char *} arguments. | |
3146 | |
3147 @example | |
3148 int stat_lisp (Lisp_Object path, struct stat *buf); /* Implement me */ | |
3149 @end example | |
3150 | 3210 |
3151 Also note that many internal functions, such as @code{make_string}, | 3211 Also note that many internal functions, such as @code{make_string}, |
3152 accept Bufbytes, which removes the need for them to convert the data | 3212 accept Ibytes, which removes the need for them to convert the data they |
3153 they receive. This increases efficiency because that way external data | 3213 receive. This increases efficiency because that way external data needs |
3154 needs to be decoded only once, when it is read. After that, it is | 3214 to be decoded only once, when it is read. After that, it is passed |
3155 passed around in internal format. | 3215 around in internal format. |
3216 | |
3217 @item Do all work in internal format | |
3218 External-formatted data is completely unpredictable in its format. It | |
3219 may be Unicode (non-ASCII compatible); it may be a modal encoding, in | |
3220 which case some occurrences of (e.g.) the slash character may be part of | |
3221 two-byte Asian-language characters, and a naive attempt to split apart a | |
3222 pathname by slashes will fail; etc. Internal-format text should be | |
3223 converted to external format only at the point where an external API is | |
3224 actually called, and the first thing done after receiving | |
3225 external-format text from an external API should be to convert it to | |
3226 internal text. | |
3156 @end table | 3227 @end table |
3157 | 3228 |
3158 @node An Example of Mule-Aware Code | 3229 @node An Example of Mule-Aware Code |
3159 @subsection An Example of Mule-Aware Code | 3230 @subsection An Example of Mule-Aware Code |
3160 @cindex code, an example of Mule-aware | 3231 @cindex code, an example of Mule-aware |
3169 DEFUN ("string", Fstring, 0, MANY, 0, /* | 3240 DEFUN ("string", Fstring, 0, MANY, 0, /* |
3170 Concatenate all the argument characters and make the result a string. | 3241 Concatenate all the argument characters and make the result a string. |
3171 */ | 3242 */ |
3172 (int nargs, Lisp_Object *args)) | 3243 (int nargs, Lisp_Object *args)) |
3173 @{ | 3244 @{ |
3174 Bufbyte *storage = alloca_array (Bufbyte, nargs * MAX_EMCHAR_LEN); | 3245 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN); |
3175 Bufbyte *p = storage; | 3246 Ibyte *p = storage; |
3176 | 3247 |
3177 for (; nargs; nargs--, args++) | 3248 for (; nargs; nargs--, args++) |
3178 @{ | 3249 @{ |
3179 Lisp_Object lisp_char = *args; | 3250 Lisp_Object lisp_char = *args; |
3180 CHECK_CHAR_COERCE_INT (lisp_char); | 3251 CHECK_CHAR_COERCE_INT (lisp_char); |
3181 p += set_charptr_emchar (p, XCHAR (lisp_char)); | 3252 p += set_itext_ichar (p, XCHAR (lisp_char)); |
3182 @} | 3253 @} |
3183 return make_string (storage, p - storage); | 3254 return make_string (storage, p - storage); |
3184 @} | 3255 @} |
3185 @end group | 3256 @end group |
3186 @end example | 3257 @end example |
3187 | 3258 |
3188 Now we can analyze the source line by line. | 3259 Now we can analyze the source line by line. |
3189 | 3260 |
3190 Obviously, string will be as long as there are arguments to the | 3261 Obviously, string will be as long as there are arguments to the |
3191 function. This is why we allocate @code{MAX_EMCHAR_LEN} * @var{nargs} | 3262 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs} |
3192 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs} | 3263 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs} |
3193 @code{Emchar}s to fit in the string. | 3264 @code{Ichar}s to fit in the string. |
3194 | 3265 |
3195 Then, the loop checks that each element is a character, converting | 3266 Then, the loop checks that each element is a character, converting |
3196 integers in the process. Like many other functions in XEmacs, this | 3267 integers in the process. Like many other functions in XEmacs, this |
3197 function silently accepts integers where characters are expected, for | 3268 function silently accepts integers where characters are expected, for |
3198 historical and compatibility reasons. Unless you know what you are | 3269 historical and compatibility reasons. Unless you know what you are |
3199 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)} | 3270 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)} |
3200 extracts the @code{Emchar} from the @code{Lisp_Object}, and | 3271 extracts the @code{Ichar} from the @code{Lisp_Object}, and |
3201 @code{set_charptr_emchar} stores it to storage, increasing @code{p} in | 3272 @code{set_itext_ichar} stores it to storage, increasing @code{p} in |
3202 the process. | 3273 the process. |
3203 | 3274 |
3204 Other instructive examples of correct coding under Mule can be found all | 3275 Other instructive examples of correct coding under Mule can be found all |
3205 over the XEmacs code. For starters, I recommend | 3276 over the XEmacs code. For starters, I recommend |
3206 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have | 3277 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have |
3207 understood this section of the manual and studied the examples, you can | 3278 understood this section of the manual and studied the examples, you can |
3208 proceed writing new Mule-aware code. | 3279 proceed writing new Mule-aware code. |
3280 | |
3281 @node Mule-izing Code | |
3282 @subsection Mule-izing Code | |
3283 | |
3284 A lot of code is written without Mule in mind, and needs to be made | |
3285 Mule-correct or "Mule-ized". There is really no substitute for | |
3286 line-by-line analysis when doing this, but the following checklist can | |
3287 help: | |
3288 | |
3289 @itemize @bullet | |
3290 @item | |
3291 Check all uses of @code{XSTRING_DATA}. | |
3292 @item | |
3293 Check all uses of @code{build_string} and @code{make_string}. | |
3294 @item | |
3295 Check all uses of @code{tolower} and @code{toupper}. | |
3296 @item | |
3297 Check object print methods. | |
3298 @item | |
3299 Check for use of functions such as @code{write_c_string}, | |
3300 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}. | |
3301 @item | |
3302 Check all occurrences of @code{char} and correct to one of the other | |
3303 typedefs described above. | |
3304 @item | |
3305 Check all existing uses of @code{TO_EXTERNAL_FORMAT}, | |
3306 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for | |
3307 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}). | |
3308 @item | |
3309 In Windows code, string literals may need to be encapsulated with @code{XETEXT}. | |
3310 @end itemize | |
3209 | 3311 |
3210 @node Techniques for XEmacs Developers | 3312 @node Techniques for XEmacs Developers |
3211 @section Techniques for XEmacs Developers | 3313 @section Techniques for XEmacs Developers |
3212 @cindex techniques for XEmacs developers | 3314 @cindex techniques for XEmacs developers |
3213 @cindex developers, techniques for XEmacs | 3315 @cindex developers, techniques for XEmacs |
8009 @menu | 8111 @menu |
8010 * Introduction to Buffers:: A buffer holds a block of text such as a file. | 8112 * Introduction to Buffers:: A buffer holds a block of text such as a file. |
8011 * The Text in a Buffer:: Representation of the text in a buffer. | 8113 * The Text in a Buffer:: Representation of the text in a buffer. |
8012 * Buffer Lists:: Keeping track of all buffers. | 8114 * Buffer Lists:: Keeping track of all buffers. |
8013 * Markers and Extents:: Tagging locations within a buffer. | 8115 * Markers and Extents:: Tagging locations within a buffer. |
8014 * Bufbytes and Emchars:: Representation of individual characters. | 8116 * Ibytes and Ichars:: Representation of individual characters. |
8015 * The Buffer Object:: The Lisp object corresponding to a buffer. | 8117 * The Buffer Object:: The Lisp object corresponding to a buffer. |
8016 @end menu | 8118 @end menu |
8017 | 8119 |
8018 @node Introduction to Buffers | 8120 @node Introduction to Buffers |
8019 @section Introduction to Buffers | 8121 @section Introduction to Buffers |
8085 | 8187 |
8086 For now, we can view a character as some non-negative integer that | 8188 For now, we can view a character as some non-negative integer that |
8087 has some shape that defines how it typically appears (e.g. as an | 8189 has some shape that defines how it typically appears (e.g. as an |
8088 uppercase A). (The exact way in which a character appears depends on the | 8190 uppercase A). (The exact way in which a character appears depends on the |
8089 font used to display the character.) The internal type of characters in | 8191 font used to display the character.) The internal type of characters in |
8090 the C code is an @code{Emchar}; this is just an @code{int}, but using a | 8192 the C code is an @code{Ichar}; this is just an @code{int}, but using a |
8091 symbolic type makes the code clearer. | 8193 symbolic type makes the code clearer. |
8092 | 8194 |
8093 Between every character in a buffer is a @dfn{buffer position} or | 8195 Between every character in a buffer is a @dfn{buffer position} or |
8094 @dfn{character position}. We can speak of the character before or after | 8196 @dfn{character position}. We can speak of the character before or after |
8095 a particular buffer position, and when you insert a character at a | 8197 a particular buffer position, and when you insert a character at a |
8102 Buffer positions are numbered starting at 1. This means that | 8204 Buffer positions are numbered starting at 1. This means that |
8103 position 1 is before the first character, and position 0 is not | 8205 position 1 is before the first character, and position 0 is not |
8104 valid. If there are N characters in a buffer, then buffer | 8206 valid. If there are N characters in a buffer, then buffer |
8105 position N+1 is after the last one, and position N+2 is not valid. | 8207 position N+1 is after the last one, and position N+2 is not valid. |
8106 | 8208 |
8107 The internal makeup of the Emchar integer varies depending on whether | 8209 The internal makeup of the Ichar integer varies depending on whether |
8108 we have compiled with MULE support. If not, the Emchar integer is an | 8210 we have compiled with MULE support. If not, the Ichar integer is an |
8109 8-bit integer with possible values from 0 - 255. 0 - 127 are the | 8211 8-bit integer with possible values from 0 - 255. 0 - 127 are the |
8110 standard ASCII characters, while 128 - 255 are the characters from the | 8212 standard ASCII characters, while 128 - 255 are the characters from the |
8111 ISO-8859-1 character set. If we have compiled with MULE support, an | 8213 ISO-8859-1 character set. If we have compiled with MULE support, an |
8112 Emchar is a 19-bit integer, with the various bits having meanings | 8214 Ichar is a 19-bit integer, with the various bits having meanings |
8113 according to a complex scheme that will be detailed later. The | 8215 according to a complex scheme that will be detailed later. The |
8114 characters numbered 0 - 255 still have the same meanings as for the | 8216 characters numbered 0 - 255 still have the same meanings as for the |
8115 non-MULE case, though. | 8217 non-MULE case, though. |
8116 | 8218 |
8117 Internally, the text in a buffer is represented in a fairly simple | 8219 Internally, the text in a buffer is represented in a fairly simple |
8146 the situation is different. In this case, the space @emph{will} be | 8248 the situation is different. In this case, the space @emph{will} be |
8147 released back to the operating system. However, this tends to result in a | 8249 released back to the operating system. However, this tends to result in a |
8148 noticeable speed penalty.) | 8250 noticeable speed penalty.) |
8149 | 8251 |
8150 Astute readers may notice that the text in a buffer is represented as | 8252 Astute readers may notice that the text in a buffer is represented as |
8151 an array of @emph{bytes}, while (at least in the MULE case) an Emchar is | 8253 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is |
8152 a 19-bit integer, which clearly cannot fit in a byte. This means (of | 8254 a 19-bit integer, which clearly cannot fit in a byte. This means (of |
8153 course) that the text in a buffer uses a different representation from | 8255 course) that the text in a buffer uses a different representation from |
8154 an Emchar: specifically, the 19-bit Emchar becomes a series of one to | 8256 an Ichar: specifically, the 19-bit Ichar becomes a series of one to |
8155 four bytes. The conversion between these two representations is complex | 8257 four bytes. The conversion between these two representations is complex |
8156 and will be described later. | 8258 and will be described later. |
8157 | 8259 |
8158 In the non-MULE case, everything is very simple: An Emchar | 8260 In the non-MULE case, everything is very simple: An Ichar |
8159 is an 8-bit value, which fits neatly into one byte. | 8261 is an 8-bit value, which fits neatly into one byte. |
8160 | 8262 |
8161 If we are given a buffer position and want to retrieve the | 8263 If we are given a buffer position and want to retrieve the |
8162 character at that position, we need to follow these steps: | 8264 character at that position, we need to follow these steps: |
8163 | 8265 |
8178 position that is @dfn{at} the gap, we always use the memory position at | 8280 position that is @dfn{at} the gap, we always use the memory position at |
8179 the @emph{beginning}, not at the end, of the gap. | 8281 the @emph{beginning}, not at the end, of the gap. |
8180 @item | 8282 @item |
8181 Fetch the appropriate bytes at the determined memory position. | 8283 Fetch the appropriate bytes at the determined memory position. |
8182 @item | 8284 @item |
8183 Convert these bytes into an Emchar. | 8285 Convert these bytes into an Ichar. |
8184 @end enumerate | 8286 @end enumerate |
8185 | 8287 |
8186 In the non-Mule case, (3) and (4) boil down to a simple one-byte | 8288 In the non-Mule case, (3) and (4) boil down to a simple one-byte |
8187 memory access. | 8289 memory access. |
8188 | 8290 |
8189 Note that we have defined three types of positions in a buffer: | 8291 Note that we have defined three types of positions in a buffer: |
8190 | 8292 |
8191 @enumerate | 8293 @enumerate |
8192 @item | 8294 @item |
8193 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Bufpos} | 8295 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos} |
8194 @item | 8296 @item |
8195 @dfn{byte indices}, typedef @code{Bytind} | 8297 @dfn{byte indices}, typedef @code{Bytebpos} |
8196 @item | 8298 @item |
8197 @dfn{memory indices}, typedef @code{Memind} | 8299 @dfn{memory indices}, typedef @code{Membpos} |
8198 @end enumerate | 8300 @end enumerate |
8199 | 8301 |
8200 All three typedefs are just @code{int}s, but defining them this way makes | 8302 All three typedefs are just @code{int}s, but defining them this way makes |
8201 things a lot clearer. | 8303 things a lot clearer. |
8202 | 8304 |
8203 Most code works with buffer positions. In particular, all Lisp code | 8305 Most code works with buffer positions. In particular, all Lisp code |
8204 that refers to text in a buffer uses buffer positions. Lisp code does | 8306 that refers to text in a buffer uses buffer positions. Lisp code does |
8205 not know that byte indices or memory indices exist. | 8307 not know that byte indices or memory indices exist. |
8206 | 8308 |
8207 Finally, we have a typedef for the bytes in a buffer. This is a | 8309 Finally, we have a typedef for the bytes in a buffer. This is a |
8208 @code{Bufbyte}, which is an unsigned char. Referring to them as | 8310 @code{Ibyte}, which is an unsigned char. Referring to them as |
8209 Bufbytes underscores the fact that we are working with a string of bytes | 8311 Ibytes underscores the fact that we are working with a string of bytes |
8210 in the internal Emacs buffer representation rather than in one of a | 8312 in the internal Emacs buffer representation rather than in one of a |
8211 number of possible alternative representations (e.g. EUC-encoded text, | 8313 number of possible alternative representations (e.g. EUC-encoded text, |
8212 etc.). | 8314 etc.). |
8213 | 8315 |
8214 @node Buffer Lists | 8316 @node Buffer Lists |
8274 | 8376 |
8275 The important thing here is that markers and extents simply contain | 8377 The important thing here is that markers and extents simply contain |
8276 buffer positions in them as integers, and every time text is inserted or | 8378 buffer positions in them as integers, and every time text is inserted or |
8277 deleted, these positions must be updated. In order to minimize the | 8379 deleted, these positions must be updated. In order to minimize the |
8278 amount of shuffling that needs to be done, the positions in markers and | 8380 amount of shuffling that needs to be done, the positions in markers and |
8279 extents (there's one per marker, two per extent) are stored in Meminds. | 8381 extents (there's one per marker, two per extent) are stored in Membpos's. |
8280 This means that they only need to be moved when the text is physically | 8382 This means that they only need to be moved when the text is physically |
8281 moved in memory; since the gap structure tries to minimize this, it also | 8383 moved in memory; since the gap structure tries to minimize this, it also |
8282 minimizes the number of marker and extent indices that need to be | 8384 minimizes the number of marker and extent indices that need to be |
8283 adjusted. Look in @file{insdel.c} for the details of how this works. | 8385 adjusted. Look in @file{insdel.c} for the details of how this works. |
8284 | 8386 |
8288 is no way to determine what markers are in a buffer if you are just | 8390 is no way to determine what markers are in a buffer if you are just |
8289 given the buffer. Extents remain in a buffer until they are detached | 8391 given the buffer. Extents remain in a buffer until they are detached |
8290 (which could happen as a result of text being deleted) or the buffer is | 8392 (which could happen as a result of text being deleted) or the buffer is |
8291 deleted, and primitives do exist to enumerate the extents in a buffer. | 8393 deleted, and primitives do exist to enumerate the extents in a buffer. |
8292 | 8394 |
8293 @node Bufbytes and Emchars | 8395 @node Ibytes and Ichars |
8294 @section Bufbytes and Emchars | 8396 @section Ibytes and Ichars |
8295 @cindex Bufbytes and Emchars | 8397 @cindex Ibytes and Ichars |
8296 @cindex Emchars, Bufbytes and | 8398 @cindex Ichars, Ibytes and |
8297 | 8399 |
8298 Not yet documented. | 8400 Not yet documented. |
8299 | 8401 |
8300 @node The Buffer Object | 8402 @node The Buffer Object |
8301 @section The Buffer Object | 8403 @section The Buffer Object |
8402 @cindex character sets and encodings, Mule | 8504 @cindex character sets and encodings, Mule |
8403 @cindex encodings, Mule character sets and | 8505 @cindex encodings, Mule character sets and |
8404 | 8506 |
8405 Recall that there are two primary ways that text is represented in | 8507 Recall that there are two primary ways that text is represented in |
8406 XEmacs. The @dfn{buffer} representation sees the text as a series of | 8508 XEmacs. The @dfn{buffer} representation sees the text as a series of |
8407 bytes (Bufbytes), with a variable number of bytes used per character. | 8509 bytes (Ibytes), with a variable number of bytes used per character. |
8408 The @dfn{character} representation sees the text as a series of integers | 8510 The @dfn{character} representation sees the text as a series of integers |
8409 (Emchars), one per character. The character representation is a cleaner | 8511 (Ichars), one per character. The character representation is a cleaner |
8410 representation from a theoretical standpoint, and is thus used in many | 8512 representation from a theoretical standpoint, and is thus used in many |
8411 cases when lots of manipulations on a string need to be done. However, | 8513 cases when lots of manipulations on a string need to be done. However, |
8412 the buffer representation is the standard representation used in both | 8514 the buffer representation is the standard representation used in both |
8413 Lisp strings and buffers, and because of this, it is the ``default'' | 8515 Lisp strings and buffers, and because of this, it is the ``default'' |
8414 representation that text comes in. The reason for using this | 8516 representation that text comes in. The reason for using this |
9037 @deftypefunx int Lstream_fgetc (Lstream *@var{stream}) | 9139 @deftypefunx int Lstream_fgetc (Lstream *@var{stream}) |
9038 @deftypefunx void Lstream_fungetc (Lstream *@var{stream}, int @var{c}) | 9140 @deftypefunx void Lstream_fungetc (Lstream *@var{stream}, int @var{c}) |
9039 Function equivalents of the above macros. | 9141 Function equivalents of the above macros. |
9040 @end deftypefun | 9142 @end deftypefun |
9041 | 9143 |
9042 @deftypefun ssize_t Lstream_read (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) | 9144 @deftypefun Bytecount Lstream_read (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size}) |
9043 Read @var{size} bytes of @var{data} from the stream. Return the number | 9145 Read @var{size} bytes of @var{data} from the stream. Return the number |
9044 of bytes read. 0 means EOF. -1 means an error occurred and no bytes | 9146 of bytes read. 0 means EOF. -1 means an error occurred and no bytes |
9045 were read. | 9147 were read. |
9046 @end deftypefun | 9148 @end deftypefun |
9047 | 9149 |
9048 @deftypefun ssize_t Lstream_write (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) | 9150 @deftypefun Bytecount Lstream_write (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size}) |
9049 Write @var{size} bytes of @var{data} to the stream. Return the number | 9151 Write @var{size} bytes of @var{data} to the stream. Return the number |
9050 of bytes written. -1 means an error occurred and no bytes were written. | 9152 of bytes written. -1 means an error occurred and no bytes were written. |
9051 @end deftypefun | 9153 @end deftypefun |
9052 | 9154 |
9053 @deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) | 9155 @deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size}) |
9054 Push back @var{size} bytes of @var{data} onto the input queue. The next | 9156 Push back @var{size} bytes of @var{data} onto the input queue. The next |
9055 call to @code{Lstream_read()} with the same size will read the same | 9157 call to @code{Lstream_read()} with the same size will read the same |
9056 bytes back. Note that this will be the case even if there is other | 9158 bytes back. Note that this will be the case even if there is other |
9057 pending unread data. | 9159 pending unread data. |
9058 @end deftypefun | 9160 @end deftypefun |
9074 | 9176 |
9075 @node Lstream Methods | 9177 @node Lstream Methods |
9076 @section Lstream Methods | 9178 @section Lstream Methods |
9077 @cindex lstream methods | 9179 @cindex lstream methods |
9078 | 9180 |
9079 @deftypefn {Lstream Method} ssize_t reader (Lstream *@var{stream}, unsigned char *@var{data}, size_t @var{size}) | 9181 @deftypefn {Lstream Method} Bytecount reader (Lstream *@var{stream}, unsigned char *@var{data}, Bytecount @var{size}) |
9080 Read some data from the stream's end and store it into @var{data}, which | 9182 Read some data from the stream's end and store it into @var{data}, which |
9081 can hold @var{size} bytes. Return the number of bytes read. A return | 9183 can hold @var{size} bytes. Return the number of bytes read. A return |
9082 value of 0 means no bytes can be read at this time. This may be because | 9184 value of 0 means no bytes can be read at this time. This may be because |
9083 of an EOF, or because there is a granularity greater than one byte that | 9185 of an EOF, or because there is a granularity greater than one byte that |
9084 the stream imposes on the returned data, and @var{size} is less than | 9186 the stream imposes on the returned data, and @var{size} is less than |
9091 calls @code{Lstream_read()} with a very small size. | 9193 calls @code{Lstream_read()} with a very small size. |
9092 | 9194 |
9093 This function can be @code{NULL} if the stream is output-only. | 9195 This function can be @code{NULL} if the stream is output-only. |
9094 @end deftypefn | 9196 @end deftypefn |
9095 | 9197 |
9096 @deftypefn {Lstream Method} ssize_t writer (Lstream *@var{stream}, const unsigned char *@var{data}, size_t @var{size}) | 9198 @deftypefn {Lstream Method} Bytecount writer (Lstream *@var{stream}, const unsigned char *@var{data}, Bytecount @var{size}) |
9097 Send some data to the stream's end. Data to be sent is in @var{data} | 9199 Send some data to the stream's end. Data to be sent is in @var{data} |
9098 and is @var{size} bytes. Return the number of bytes sent. This | 9200 and is @var{size} bytes. Return the number of bytes sent. This |
9099 function can send and return fewer bytes than is passed in; in that | 9201 function can send and return fewer bytes than is passed in; in that |
9100 case, the function will just be called again until there is no data left | 9202 case, the function will just be called again until there is no data left |
9101 or 0 is returned. A return value of 0 means that no more data can be | 9203 or 0 is returned. A return value of 0 means that no more data can be |
9692 Similarly, a string may or may not have an extent_info structure. | 9794 Similarly, a string may or may not have an extent_info structure. |
9693 (Generally it won't if there haven't been any extents added to the | 9795 (Generally it won't if there haven't been any extents added to the |
9694 string.) So use the @code{_force} version if you need the extent_info | 9796 string.) So use the @code{_force} version if you need the extent_info |
9695 structure to be there. | 9797 structure to be there. |
9696 | 9798 |
9697 A list of extents is maintained as a double gap array: one gap array | 9799 A list of extents is maintained as a double gap array: One gap array |
9698 is ordered by start index (the @dfn{display order}) and the other is | 9800 is ordered by start index (the @dfn{display order}) and the other is |
9699 ordered by end index (the @dfn{e-order}). Note that positions in an | 9801 ordered by end index (the @dfn{e-order}). Note that positions in an |
9700 extent list should logically be conceived of as referring @emph{to} a | 9802 extent list should logically be conceived of as referring @emph{to} a |
9701 particular extent (as is the norm in programs) rather than sitting | 9803 particular extent (as is the norm in programs) rather than sitting |
9702 between two extents. Note also that callers of these functions should | 9804 between two extents. Note also that callers of these functions should |
9703 not be aware of the fact that the extent list is implemented as an | 9805 not be aware of the fact that the extent list is implemented as an |
9704 array, except for the fact that positions are integers (this should be | 9806 array, except for the fact that positions are integers (this should be |
9705 generalized to handle integers and linked list equally well). | 9807 generalized to handle integers and linked list equally well). |
9808 | |
9809 A gap array is the same structure used by buffer text: an array of | |
9810 elements with a "gap" somewhere in the middle. Insertion and deletion | |
9811 happens by moving the gap to the insertion/deletion point, and then | |
9812 expanding/contracting as necessary. Gap arrays have a number of | |
9813 useful properties: | |
9814 | |
9815 @enumerate | |
9816 @item | |
9817 They are space efficient, as there is no need for next/previous pointers. | |
9818 | |
9819 @item | |
9820 If the items in them are sorted, locating an item is fast -- @math{O(log N)}. | |
9821 | |
9822 @item | |
9823 Insertion and deletion is very fast (constant time, essentially) if the | |
9824 gap is near (which favors localized operations, as will usually be the | |
9825 case). Even if not, it requires only a block move of memory, which is | |
9826 generally a highly optimized operation on modern processors. | |
9827 | |
9828 @item | |
9829 Code to manipulate them is relatively simple to write. | |
9830 @end enumerate | |
9831 | |
9832 An alternative would be a balanced binary trees, which have guaranteed | |
9833 @math{O(log N)} time for all operations (although the constant factors | |
9834 are not as good, and repeated localized operations will be slower than | |
9835 for a gap array). Such code is quite tricky to write, however. | |
9706 | 9836 |
9707 @node Zero-Length Extents | 9837 @node Zero-Length Extents |
9708 @section Zero-Length Extents | 9838 @section Zero-Length Extents |
9709 @cindex zero-length extents | 9839 @cindex zero-length extents |
9710 @cindex extents, zero-length | 9840 @cindex extents, zero-length |
9829 This is the analog of Theorem 1, and applies because the e-order | 9959 This is the analog of Theorem 1, and applies because the e-order |
9830 sorts by increasing ending index. | 9960 sorts by increasing ending index. |
9831 | 9961 |
9832 Therefore, @math{F} can be found in the same amount of time as | 9962 Therefore, @math{F} can be found in the same amount of time as |
9833 operation (1), i.e. the time that it takes to locate where an extent | 9963 operation (1), i.e. the time that it takes to locate where an extent |
9834 would go if inserted into the e-order list. | 9964 would go if inserted into the e-order list. This is @math{O(log N)}, |
9835 | 9965 since we are using gap arrays to manage extents. |
9836 If the lists were stored as balanced binary trees, then operation (1) | |
9837 would take logarithmic time, which is usually quite fast. However, | |
9838 currently they're stored as simple doubly-linked lists, and instead we | |
9839 do some caching to try to speed things up. | |
9840 | 9966 |
9841 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents | 9967 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents |
9842 (ordered in the display order) that overlap an index @math{I}, together | 9968 (ordered in display order and e-order, just like for normal extent |
9843 with the SOE's @dfn{previous} extent, which is an extent that precedes | 9969 lists) that overlap an index @math{I}. |
9844 @math{I} in the e-order. (Hopefully there will not be very many extents | |
9845 between @math{I} and the previous extent.) | |
9846 | 9970 |
9847 Now: | 9971 Now: |
9848 | 9972 |
9849 Let @math{I} be an index, let @math{S} be the stack of extents on | 9973 Let @math{I} be an index, let @math{S} be the stack of extents on |
9850 @math{I}, let @math{F} be the first extent in @math{S}, and let @math{P} | 9974 @math{I} and let @math{F} be the first extent in @math{S}. |
9851 be @math{S}'s previous extent. | |
9852 | 9975 |
9853 Theorem 3: The first extent in @math{S} is the first extent that overlaps | 9976 Theorem 3: The first extent in @math{S} is the first extent that overlaps |
9854 any range @math{[I, J]}. | 9977 any range @math{[I, J]}. |
9855 | 9978 |
9856 Proof: Any extent that overlaps @math{[I, J]} but does not include | 9979 Proof: Any extent that overlaps @math{[I, J]} but does not include |