comparison man/internals/internals.texi @ 1261:465bd3c7d932

[xemacs-hg @ 2003-02-06 06:35:47 by ben] various bug fixes mule/cyril-util.el: Fix compile warning. loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we get better debug results. update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure. simple.el: Omit M-S-home/end from motion keys. update-elc.el: Overhaul: -- allow list of "early-compile" files to be specified, not hardcoded -- fix autoload checking to include all .el files, not just dumped ones -- be smarter about regenerating autoloads, so we don't need to use loadup-el if not necessary -- use standard methods for loading/not loading auto-autoloads.el (maybe fixes "Already loaded" error?) -- rename misleading NOBYTECOMPILE flag file. window-xemacs.el: Fix bug in default param. window-xemacs.el: Fix compile warnings. lwlib-Xm.c: Fix compile warning. lispref/mule.texi: Lots of Mule rewriting. internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and lots of Mule rewriting. config.inc.samp: Various fixups. Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE. esd.c: Warning fixes. fns.c: Eliminate bogus require-prints-loading-message; use already existent load-always-display-messages instead. Make sure `load' knows we are coming from `require'. lread.c: Turn on `load-warn-when-source-newer' by default. Change loading message to indicate when we are `require'ing. Eliminate purify_flag hacks to display more messages; instead, loadup and friends specify this explicitly with `load-always-display-messages'. Add spaces when batch to clearly indicate recursive loading. Fassoc() does not GC so no need to gcpro. gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing of callbacks in lwlib structures. eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or backtraces get all screwed up as some values are printed through the stream console and some aren't. Export canonicalize_printcharfun() and use in Fbacktrace().
author ben
date Thu, 06 Feb 2003 06:36:17 +0000
parents c1553814932e
children bada4b0bce3a
comparison
equal deleted inserted replaced
1260:278c9cd3435e 1261:465bd3c7d932
265 265
266 * Introduction to Buffers:: A buffer holds a block of text such as a file. 266 * Introduction to Buffers:: A buffer holds a block of text such as a file.
267 * The Text in a Buffer:: Representation of the text in a buffer. 267 * The Text in a Buffer:: Representation of the text in a buffer.
268 * Buffer Lists:: Keeping track of all buffers. 268 * Buffer Lists:: Keeping track of all buffers.
269 * Markers and Extents:: Tagging locations within a buffer. 269 * Markers and Extents:: Tagging locations within a buffer.
270 * Bufbytes and Emchars:: Representation of individual characters. 270 * Ibytes and Ichars:: Representation of individual characters.
271 * The Buffer Object:: The Lisp object corresponding to a buffer. 271 * The Buffer Object:: The Lisp object corresponding to a buffer.
272 272
273 MULE Character Sets and Encodings 273 MULE Character Sets and Encodings
274 274
275 * Character Sets:: 275 * Character Sets::
2746 * Character-Related Data Types:: 2746 * Character-Related Data Types::
2747 * Working With Character and Byte Positions:: 2747 * Working With Character and Byte Positions::
2748 * Conversion to and from External Data:: 2748 * Conversion to and from External Data::
2749 * General Guidelines for Writing Mule-Aware Code:: 2749 * General Guidelines for Writing Mule-Aware Code::
2750 * An Example of Mule-Aware Code:: 2750 * An Example of Mule-Aware Code::
2751 * Mule-izing Code::
2751 @end menu 2752 @end menu
2752 2753
2753 @node Character-Related Data Types 2754 @node Character-Related Data Types
2754 @subsection Character-Related Data Types 2755 @subsection Character-Related Data Types
2755 @cindex character-related data types 2756 @cindex character-related data types
2756 @cindex data types, character-related 2757 @cindex data types, character-related
2757 2758
2758 First, let's review the basic character-related datatypes used by 2759 First, let's review the basic character-related datatypes used by
2759 XEmacs. Note that the separate @code{typedef}s are not mandatory in the 2760 XEmacs. Note that some of the separate @code{typedef}s are not
2760 current implementation (all of them boil down to @code{unsigned char} or 2761 mandatory, but they improve clarity of code a great deal, because one
2761 @code{int}), but they improve clarity of code a great deal, because one
2762 glance at the declaration can tell the intended use of the variable. 2762 glance at the declaration can tell the intended use of the variable.
2763 2763
2764 @table @code 2764 @table @code
2765 @item Emchar 2765 @item Ichar
2766 @cindex Emchar 2766 @cindex Ichar
2767 An @code{Emchar} holds a single Emacs character. 2767 An @code{Ichar} holds a single Emacs character.
2768 2768
2769 Obviously, the equality between characters and bytes is lost in the Mule 2769 Obviously, the equality between characters and bytes is lost in the Mule
2770 world. Characters can be represented by one or more bytes in the 2770 world. Characters can be represented by one or more bytes in the
2771 buffer, and @code{Emchar} is the C type large enough to hold any 2771 buffer, and @code{Ichar} is the C type large enough to hold any
2772 character. 2772 character.
2773 2773
2774 Without Mule support, an @code{Emchar} is equivalent to an 2774 Without Mule support, an @code{Ichar} is equivalent to an
2775 @code{unsigned char}. 2775 @code{unsigned char}.
2776 2776
2777 @item Bufbyte 2777 @item Ibyte
2778 @cindex Bufbyte 2778 @cindex Ibyte
2779 The data representing the text in a buffer or string is logically a set 2779 The data representing the text in a buffer or string is logically a set
2780 of @code{Bufbyte}s. 2780 of @code{Ibyte}s.
2781 2781
2782 XEmacs does not work with the same character formats all the time; when 2782 XEmacs does not work with the same character formats all the time; when
2783 reading characters from the outside, it decodes them to an internal 2783 reading characters from the outside, it decodes them to an internal
2784 format, and likewise encodes them when writing. @code{Bufbyte} (in fact 2784 format, and likewise encodes them when writing. @code{Ibyte} (in fact
2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and 2785 @code{unsigned char}) is the basic unit of XEmacs internal buffers and
2786 strings format. A @code{Bufbyte *} is the type that points at text 2786 strings format. A @code{Ibyte *} is the type that points at text
2787 encoded in the variable-width internal encoding. 2787 encoded in the variable-width internal encoding.
2788 2788
2789 One character can correspond to one or more @code{Bufbyte}s. In the 2789 One character can correspond to one or more @code{Ibyte}s. In the
2790 current Mule implementation, an ASCII character is represented by the 2790 current Mule implementation, an ASCII character is represented by the
2791 same @code{Bufbyte}, and other characters are represented by a sequence 2791 same @code{Ibyte}, and other characters are represented by a sequence
2792 of two or more @code{Bufbyte}s. 2792 of two or more @code{Ibyte}s.
2793 2793
2794 Without Mule support, there are exactly 256 characters, implicitly 2794 Without Mule support, there are exactly 256 characters, implicitly
2795 Latin-1, and each character is represented using one @code{Bufbyte}, and 2795 Latin-1, and each character is represented using one @code{Ibyte}, and
2796 there is a one-to-one correspondence between @code{Bufbyte}s and 2796 there is a one-to-one correspondence between @code{Ibyte}s and
2797 @code{Emchar}s. 2797 @code{Ichar}s.
2798 2798
2799 @item Bufpos 2799 @item Charxpos
2800 @item Charbpos
2800 @itemx Charcount 2801 @itemx Charcount
2801 @cindex Bufpos 2802 @cindex Charxpos
2803 @cindex Charbpos
2802 @cindex Charcount 2804 @cindex Charcount
2803 A @code{Bufpos} represents a character position in a buffer or string. 2805 A @code{Charbpos} represents a character position in a buffer. A
2804 A @code{Charcount} represents a number (count) of characters. 2806 @code{Charcount} represents a number (count) of characters. Logically,
2805 Logically, subtracting two @code{Bufpos} values yields a 2807 subtracting two @code{Charbpos} values yields a @code{Charcount} value.
2806 @code{Charcount} value. Although all of these are @code{typedef}ed to 2808 When representing a character position in a string, we just use
2809 @code{Charcount} directly. The reason for having a separate typedef for
2810 buffer positions is that they are 1-based, whereas string positions are
2811 0-based and hence string counts and positions can be freely intermixed (a
2812 string position is equivalent to the count of characters from the
2813 beginning). When representing a character position that could be either
2814 in a buffer or string (for example, in the extent code), @code{Charxpos}
2815 is used. Although all of these are @code{typedef}ed to
2807 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make 2816 @code{EMACS_INT}, we use them in preference to @code{EMACS_INT} to make
2808 it clear what sort of position is being used. 2817 it clear what sort of position is being used.
2809 2818
2810 @code{Bufpos} and @code{Charcount} values are the only ones that are 2819 @code{Charxpos}, @code{Charbpos} and @code{Charcount} values are the
2811 ever visible to Lisp. 2820 only ones that are ever visible to Lisp.
2812 2821
2813 @item Bytind 2822 @item Bytexpos
2814 @itemx Bytecount 2823 @itemx Bytecount
2815 @cindex Bytind 2824 @cindex Bytebpos
2816 @cindex Bytecount 2825 @cindex Bytecount
2817 A @code{Bytind} represents a byte position in a buffer or string. A 2826 A @code{Bytebpos} represents a byte position in a buffer. A
2818 @code{Bytecount} represents the distance between two positions, in bytes. 2827 @code{Bytecount} represents the distance between two positions, in
2819 The relationship between @code{Bytind} and @code{Bytecount} is the same 2828 bytes. Byte positions in strings use @code{Bytecount}, and for byte
2820 as the relationship between @code{Bufpos} and @code{Charcount}. 2829 positions that can be either in a buffer or string, @code{Bytexpos} is
2830 used. The relationship between @code{Bytexpos}, @code{Bytebpos} and
2831 @code{Bytecount} is the same as the relationship between
2832 @code{Charxpos}, @code{Charbpos} and @code{Charcount}.
2821 2833
2822 @item Extbyte 2834 @item Extbyte
2823 @itemx Extcount
2824 @cindex Extbyte 2835 @cindex Extbyte
2825 @cindex Extcount
2826 When dealing with the outside world, XEmacs works with @code{Extbyte}s, 2836 When dealing with the outside world, XEmacs works with @code{Extbyte}s,
2827 which are equivalent to @code{unsigned char}. Obviously, an 2837 which are equivalent to @code{char}. The distance between two
2828 @code{Extcount} is the distance between two @code{Extbyte}s. Extbytes 2838 @code{Extbyte}s is a @code{Bytecount}, since external text is a
2829 and Extcounts are not all that frequent in XEmacs code. 2839 byte-by-byte encoding. Extbytes occur mainly at the transition point
2840 between internal text and external functions. XEmacs code should not,
2841 if it can possibly avoid it, do any actual manipulation using external
2842 text, since its format is completely unpredictable (it might not even be
2843 ASCII-compatible).
2830 @end table 2844 @end table
2831 2845
2832 @node Working With Character and Byte Positions 2846 @node Working With Character and Byte Positions
2833 @subsection Working With Character and Byte Positions 2847 @subsection Working With Character and Byte Positions
2834 @cindex character and byte positions, working with 2848 @cindex character and byte positions, working with
2841 @file{buffer.h}, and we don't discuss all of them here, but only the 2855 @file{buffer.h}, and we don't discuss all of them here, but only the
2842 most important ones. Examining the existing code is the best way to 2856 most important ones. Examining the existing code is the best way to
2843 learn about them. 2857 learn about them.
2844 2858
2845 @table @code 2859 @table @code
2846 @item MAX_EMCHAR_LEN 2860 @item MAX_ICHAR_LEN
2847 @cindex MAX_EMCHAR_LEN 2861 @cindex MAX_ICHAR_LEN
2848 This preprocessor constant is the maximum number of buffer bytes to 2862 This preprocessor constant is the maximum number of buffer bytes to
2849 represent an Emacs character in the variable width internal encoding. 2863 represent an Emacs character in the variable width internal encoding.
2850 It is useful when allocating temporary strings to keep a known number of 2864 It is useful when allocating temporary strings to keep a known number of
2851 characters. For instance: 2865 characters. For instance:
2852 2866
2855 @{ 2869 @{
2856 Charcount cclen; 2870 Charcount cclen;
2857 ... 2871 ...
2858 @{ 2872 @{
2859 /* Allocate place for @var{cclen} characters. */ 2873 /* Allocate place for @var{cclen} characters. */
2860 Bufbyte *buf = (Bufbyte *)alloca (cclen * MAX_EMCHAR_LEN); 2874 Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN);
2861 ... 2875 ...
2862 @end group 2876 @end group
2863 @end example 2877 @end example
2864 2878
2865 If you followed the previous section, you can guess that, logically, 2879 If you followed the previous section, you can guess that, logically,
2866 multiplying a @code{Charcount} value with @code{MAX_EMCHAR_LEN} produces 2880 multiplying a @code{Charcount} value with @code{MAX_ICHAR_LEN} produces
2867 a @code{Bytecount} value. 2881 a @code{Bytecount} value.
2868 2882
2869 In the current Mule implementation, @code{MAX_EMCHAR_LEN} equals 4. 2883 In the current Mule implementation, @code{MAX_ICHAR_LEN} equals 4.
2870 Without Mule, it is 1. 2884 Without Mule, it is 1.
2871 2885
2872 @item charptr_emchar 2886 @item itext_ichar
2873 @itemx set_charptr_emchar 2887 @itemx set_itext_ichar
2874 @cindex charptr_emchar 2888 @cindex itext_ichar
2875 @cindex set_charptr_emchar 2889 @cindex set_itext_ichar
2876 The @code{charptr_emchar} macro takes a @code{Bufbyte} pointer and 2890 The @code{itext_ichar} macro takes a @code{Ibyte} pointer and
2877 returns the @code{Emchar} stored at that position. If it were a 2891 returns the @code{Ichar} stored at that position. If it were a
2878 function, its prototype would be: 2892 function, its prototype would be:
2879 2893
2880 @example 2894 @example
2881 Emchar charptr_emchar (Bufbyte *p); 2895 Ichar itext_ichar (Ibyte *p);
2882 @end example 2896 @end example
2883 2897
2884 @code{set_charptr_emchar} stores an @code{Emchar} to the specified byte 2898 @code{set_itext_ichar} stores an @code{Ichar} to the specified byte
2885 position. It returns the number of bytes stored: 2899 position. It returns the number of bytes stored:
2886 2900
2887 @example 2901 @example
2888 Bytecount set_charptr_emchar (Bufbyte *p, Emchar c); 2902 Bytecount set_itext_ichar (Ibyte *p, Ichar c);
2889 @end example 2903 @end example
2890 2904
2891 It is important to note that @code{set_charptr_emchar} is safe only for 2905 It is important to note that @code{set_itext_ichar} is safe only for
2892 appending a character at the end of a buffer, not for overwriting a 2906 appending a character at the end of a buffer, not for overwriting a
2893 character in the middle. This is because the width of characters 2907 character in the middle. This is because the width of characters
2894 varies, and @code{set_charptr_emchar} cannot resize the string if it 2908 varies, and @code{set_itext_ichar} cannot resize the string if it
2895 writes, say, a two-byte character where a single-byte character used to 2909 writes, say, a two-byte character where a single-byte character used to
2896 reside. 2910 reside.
2897 2911
2898 A typical use of @code{set_charptr_emchar} can be demonstrated by this 2912 A typical use of @code{set_itext_ichar} can be demonstrated by this
2899 example, which copies characters from buffer @var{buf} to a temporary 2913 example, which copies characters from buffer @var{buf} to a temporary
2900 string of Bufbytes. 2914 string of Ibytes.
2901 2915
2902 @example 2916 @example
2903 @group 2917 @group
2904 @{ 2918 @{
2905 Bufpos pos; 2919 Charbpos pos;
2906 for (pos = beg; pos < end; pos++) 2920 for (pos = beg; pos < end; pos++)
2907 @{ 2921 @{
2908 Emchar c = BUF_FETCH_CHAR (buf, pos); 2922 Ichar c = BUF_FETCH_CHAR (buf, pos);
2909 p += set_charptr_emchar (buf, c); 2923 p += set_itext_ichar (buf, c);
2910 @} 2924 @}
2911 @} 2925 @}
2912 @end group 2926 @end group
2913 @end example 2927 @end example
2914 2928
2915 Note how @code{set_charptr_emchar} is used to store the @code{Emchar} 2929 Note how @code{set_itext_ichar} is used to store the @code{Ichar}
2916 and increment the counter, at the same time. 2930 and increment the counter, at the same time.
2917 2931
2918 @item INC_CHARPTR 2932 @item INC_IBYTEPTR
2919 @itemx DEC_CHARPTR 2933 @itemx DEC_IBYTEPTR
2920 @cindex INC_CHARPTR 2934 @cindex INC_IBYTEPTR
2921 @cindex DEC_CHARPTR 2935 @cindex DEC_IBYTEPTR
2922 These two macros increment and decrement a @code{Bufbyte} pointer, 2936 These two macros increment and decrement an @code{Ibyte} pointer,
2923 respectively. They will adjust the pointer by the appropriate number of 2937 respectively. They will adjust the pointer by the appropriate number of
2924 bytes according to the byte length of the character stored there. Both 2938 bytes according to the byte length of the character stored there. Both
2925 macros assume that the memory address is located at the beginning of a 2939 macros assume that the memory address is located at the beginning of a
2926 valid character. 2940 valid character.
2927 2941
2928 Without Mule support, @code{INC_CHARPTR (p)} and @code{DEC_CHARPTR (p)} 2942 Without Mule support, @code{INC_IBYTEPTR (p)} and @code{DEC_IBYTEPTR (p)}
2929 simply expand to @code{p++} and @code{p--}, respectively. 2943 simply expand to @code{p++} and @code{p--}, respectively.
2930 2944
2931 @item bytecount_to_charcount 2945 @item bytecount_to_charcount
2932 @cindex bytecount_to_charcount 2946 @cindex bytecount_to_charcount
2933 Given a pointer to a text string and a length in bytes, return the 2947 Given a pointer to a text string and a length in bytes, return the
2934 equivalent length in characters. 2948 equivalent length in characters.
2935 2949
2936 @example 2950 @example
2937 Charcount bytecount_to_charcount (Bufbyte *p, Bytecount bc); 2951 Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc);
2938 @end example 2952 @end example
2939 2953
2940 @item charcount_to_bytecount 2954 @item charcount_to_bytecount
2941 @cindex charcount_to_bytecount 2955 @cindex charcount_to_bytecount
2942 Given a pointer to a text string and a length in characters, return the 2956 Given a pointer to a text string and a length in characters, return the
2943 equivalent length in bytes. 2957 equivalent length in bytes.
2944 2958
2945 @example 2959 @example
2946 Bytecount charcount_to_bytecount (Bufbyte *p, Charcount cc); 2960 Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc);
2947 @end example 2961 @end example
2948 2962
2949 @item charptr_n_addr 2963 @item itext_n_addr
2950 @cindex charptr_n_addr 2964 @cindex itext_n_addr
2951 Return a pointer to the beginning of the character offset @var{cc} (in 2965 Return a pointer to the beginning of the character offset @var{cc} (in
2952 characters) from @var{p}. 2966 characters) from @var{p}.
2953 2967
2954 @example 2968 @example
2955 Bufbyte *charptr_n_addr (Bufbyte *p, Charcount cc); 2969 Ibyte *itext_n_addr (Ibyte *p, Charcount cc);
2956 @end example 2970 @end example
2957 @end table 2971 @end table
2958 2972
2959 @node Conversion to and from External Data 2973 @node Conversion to and from External Data
2960 @subsection Conversion to and from External Data 2974 @subsection Conversion to and from External Data
2961 @cindex conversion to and from external data 2975 @cindex conversion to and from external data
2962 @cindex external data, conversion to and from 2976 @cindex external data, conversion to and from
2963 2977
2964 When an external function, such as a C library function, returns a 2978 When an external function, such as a C library function, returns a
2965 @code{char} pointer, you should almost never treat it as @code{Bufbyte}. 2979 @code{char} pointer, you should almost never treat it as @code{Ibyte}.
2966 This is because these returned strings may contain 8bit characters which 2980 This is because these returned strings may contain 8bit characters which
2967 can be misinterpreted by XEmacs, and cause a crash. Likewise, when 2981 can be misinterpreted by XEmacs, and cause a crash. Likewise, when
2968 exporting a piece of internal text to the outside world, you should 2982 exporting a piece of internal text to the outside world, you should
2969 always convert it to an appropriate external encoding, lest the internal 2983 always convert it to an appropriate external encoding, lest the internal
2970 stuff (such as the infamous \201 characters) leak out. 2984 stuff (such as the infamous \201 characters) leak out.
2974 @file{buffer.h}. There used to be a fixed set of external formats 2988 @file{buffer.h}. There used to be a fixed set of external formats
2975 supported by these macros, but now any coding system can be used with 2989 supported by these macros, but now any coding system can be used with
2976 these macros. The coding system alias mechanism is used to create the 2990 these macros. The coding system alias mechanism is used to create the
2977 following logical coding systems, which replace the fixed external 2991 following logical coding systems, which replace the fixed external
2978 formats. The (dontusethis-set-symbol-value-handler) mechanism was 2992 formats. The (dontusethis-set-symbol-value-handler) mechanism was
2979 enhanced to make this possible (more work on that is needed - like 2993 enhanced to make this possible (more work on that is needed).
2980 remove the @code{dontusethis-} prefix). 2994
2995 Example useful coding systems:
2981 2996
2982 @table @code 2997 @table @code
2983 @item Qbinary 2998 @item Qbinary
2984 This is the simplest format and is what we use in the absence of a more 2999 This is the simplest format and is what we use in the absence of a more
2985 appropriate format. This converts according to the @code{binary} coding 3000 appropriate format. This converts according to the @code{binary} coding
2998 @item 3013 @item
2999 On output, characters 0--255 are converted into bytes 0--255 and other 3014 On output, characters 0--255 are converted into bytes 0--255 and other
3000 characters are converted into `~'. 3015 characters are converted into `~'.
3001 @end enumerate 3016 @end enumerate
3002 3017
3003 @item Qfile_name
3004 Format used for filenames. This is user-definable via either the
3005 @code{file-name-coding-system} or @code{pathname-coding-system} (now
3006 obsolete) variables.
3007
3008 @item Qnative 3018 @item Qnative
3009 Format used for the external Unix environment---@code{argv[]}, stuff 3019 Format used for the external Unix environment---@code{argv[]}, stuff
3010 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc. 3020 from @code{getenv()}, stuff from the @file{/etc/passwd} file, etc.
3011 Currently this is the same as Qfile_name. The two should be 3021 This is encoded according to the encoding specified by the current locale.
3012 distinguished for clarity and possible future separation. 3022
3023 @item Qfile_name
3024 Format used for filenames. This is normally the same as @code{Qnative},
3025 but the two should be distinguished for clarity and possible future
3026 separation -- and also because @code{Qfile_name} can be changed using either
3027 the @code{file-name-coding-system} or @code{pathname-coding-system} (now
3028 obsolete) variables.
3013 3029
3014 @item Qctext 3030 @item Qctext
3015 Compound--text format. This is the standard X11 format used for data 3031 Compound-text format. This is the standard X11 format used for data
3016 stored in properties, selections, and the like. This is an 8-bit 3032 stored in properties, selections, and the like. This is an 8-bit
3017 no-lock-shift ISO2022 coding system. This is a real coding system, 3033 no-lock-shift ISO2022 coding system. This is a real coding system,
3018 unlike Qfile_name, which is user-definable. 3034 unlike @code{Qfile_name}, which is user-definable.
3035
3036 @item Qmswindows_tstr
3037 Used for external data in all MS Windows functions that are declared to
3038 accept data of type @code{LPTSTR} or @code{LPCSTR}. This maps to either
3039 @code{Qmswindows_multibyte} (a locale-specific encoding, same as
3040 @code{Qnative}) or @code{Qmswindows_unicode}, depending on whether
3041 XEmacs is being run under Windows 9X or Windows NT/2000/XP.
3019 @end table 3042 @end table
3020 3043
3021 There are two fundamental macros to convert between external and 3044 There are two fundamental macros to convert between external and
3022 internal format. 3045 internal format, as well as various convenience macros to simplify the
3046 most common operations.
3023 3047
3024 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and 3048 @code{TO_INTERNAL_FORMAT} converts external data to internal format, and
3025 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments 3049 @code{TO_EXTERNAL_FORMAT} converts the other way around. The arguments
3026 each of these receives are a source type, a source, a sink type, a sink, 3050 each of these receives are a source type, a source, a sink type, a sink,
3027 and a coding system (or a symbol naming a coding system). 3051 and a coding system (or a symbol naming a coding system).
3065 @item @code{C_STRING_ALLOCA, ptr,} 3089 @item @code{C_STRING_ALLOCA, ptr,}
3066 equivalent to @code{ALLOCA (ptr, len_ignored)} on output. 3090 equivalent to @code{ALLOCA (ptr, len_ignored)} on output.
3067 @item @code{C_STRING_MALLOC, ptr,} 3091 @item @code{C_STRING_MALLOC, ptr,}
3068 equivalent to @code{MALLOC (ptr, len_ignored)} on output 3092 equivalent to @code{MALLOC (ptr, len_ignored)} on output
3069 @item @code{C_STRING, ptr,} 3093 @item @code{C_STRING, ptr,}
3070 equivalent to @code{DATA, (ptr, strlen (ptr) + 1)} on input 3094 equivalent to @code{DATA, (ptr, strlen/wcslen (ptr))} on input
3071 @item @code{LISP_STRING, string,} 3095 @item @code{LISP_STRING, string,}
3072 input or output is a Lisp_Object of type string 3096 input or output is a Lisp_Object of type string
3073 @item @code{LISP_BUFFER, buffer,} 3097 @item @code{LISP_BUFFER, buffer,}
3074 output is written to @code{(point)} in lisp buffer @var{buffer} 3098 output is written to @code{(point)} in lisp buffer @var{buffer}
3075 @item @code{LISP_LSTREAM, lstream,} 3099 @item @code{LISP_LSTREAM, lstream,}
3076 input or output is a Lisp_Object of type lstream 3100 input or output is a Lisp_Object of type lstream
3077 @item @code{LISP_OPAQUE, object,} 3101 @item @code{LISP_OPAQUE, object,}
3078 input or output is a Lisp_Object of type opaque 3102 input or output is a Lisp_Object of type opaque
3079 @end table 3103 @end table
3080 3104
3081 Often, the data is being converted to a '\0'-byte-terminated string, 3105 A source type of @code{C_STRING} or a sink type of
3082 which is the format required by many external system C APIs. For these 3106 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate where
3083 purposes, a source type of @code{C_STRING} or a sink type of 3107 the external API is not '\0'-byte-clean -- i.e. it expects strings to be
3084 @code{C_STRING_ALLOCA} or @code{C_STRING_MALLOC} is appropriate. 3108 terminated with a null byte. For external API's that are in fact
3085 Otherwise, we should try to keep XEmacs '\0'-byte-clean, which means 3109 '\0'-byte-clean, we should of course not use these.
3086 using (ptr, len) pairs.
3087 3110
3088 The sinks to be specified must be lvalues, unless they are the lisp 3111 The sinks to be specified must be lvalues, unless they are the lisp
3089 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}. 3112 object types @code{LISP_LSTREAM} or @code{LISP_BUFFER}.
3113
3114 There is no problem using the same lvalue for source and sink.
3115
3116 Garbage collection is inhibited during these conversion operations, so
3117 it is OK to pass in data from Lisp strings using @code{XSTRING_DATA}.
3090 3118
3091 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the 3119 For the sink types @code{ALLOCA} and @code{C_STRING_ALLOCA}, the
3092 resulting text is stored in a stack-allocated buffer, which is 3120 resulting text is stored in a stack-allocated buffer, which is
3093 automatically freed on returning from the function. However, the sink 3121 automatically freed on returning from the function. However, the sink
3094 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed 3122 types @code{MALLOC} and @code{C_STRING_MALLOC} return @code{xmalloc()}ed
3097 3125
3098 Note that it doesn't make sense for @code{LISP_STRING} to be a source 3126 Note that it doesn't make sense for @code{LISP_STRING} to be a source
3099 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}. 3127 for @code{TO_INTERNAL_FORMAT} or a sink for @code{TO_EXTERNAL_FORMAT}.
3100 You'll get an assertion failure if you try. 3128 You'll get an assertion failure if you try.
3101 3129
3130 99% of conversions involve raw data or Lisp strings as both source and
3131 sink, and usually data is output as @code{alloca()}, or sometimes
3132 @code{xmalloc()}. For this reason, convenience macros are defined for
3133 many types of conversions involving raw data and/or Lisp strings,
3134 especially when the output is an @code{alloca()}ed string. (When the
3135 destination is a Lisp string, there are other functions that should be
3136 used instead -- @code{build_ext_string()} and @code{make_ext_string()},
3137 for example.) The convenience macros are of two types -- the older kind
3138 that store the result into a specified variable, and the newer kind that
3139 return the result. The newer kind of macros don't exist when the output
3140 is sized data, because that would have two return values. NOTE: All
3141 convenience macros are ultimately defined in terms of
3142 @code{TO_EXTERNAL_FORMAT} and @code{TO_INTERNAL_FORMAT}. Thus, any
3143 comments above about the workings of these macros also apply to all
3144 convenience macros.
3145
3146 A typical old-style convenience macro is
3147
3148 @example
3149 C_STRING_TO_EXTERNAL (in, out, codesys);
3150 @end example
3151
3152 This is equivalent to
3153
3154 @example
3155 TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys);
3156 @end example
3157
3158 but is easier to write and somewhat clearer, since it clearly identifies
3159 the arguments without the clutter of having the preprocessor types mixed
3160 in.
3161
3162 The new-style equivalent is @code{NEW_C_STRING_TO_EXTERNAL (src,
3163 codesys)}, which @emph{returns} the converted data (still in
3164 @code{alloca()} space). This is far more convenient for most
3165 operations.
3102 3166
3103 @node General Guidelines for Writing Mule-Aware Code 3167 @node General Guidelines for Writing Mule-Aware Code
3104 @subsection General Guidelines for Writing Mule-Aware Code 3168 @subsection General Guidelines for Writing Mule-Aware Code
3105 @cindex writing Mule-aware code, general guidelines for 3169 @cindex writing Mule-aware code, general guidelines for
3106 @cindex Mule-aware code, general guidelines for writing 3170 @cindex Mule-aware code, general guidelines for writing
3111 3175
3112 @table @emph 3176 @table @emph
3113 @item Never use @code{char} and @code{char *}. 3177 @item Never use @code{char} and @code{char *}.
3114 In XEmacs, the use of @code{char} and @code{char *} is almost always a 3178 In XEmacs, the use of @code{char} and @code{char *} is almost always a
3115 mistake. If you want to manipulate an Emacs character from ``C'', use 3179 mistake. If you want to manipulate an Emacs character from ``C'', use
3116 @code{Emchar}. If you want to examine a specific octet in the internal 3180 @code{Ichar}. If you want to examine a specific octet in the internal
3117 format, use @code{Bufbyte}. If you want a Lisp-visible character, use a 3181 format, use @code{Ibyte}. If you want a Lisp-visible character, use a
3118 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move 3182 @code{Lisp_Object} and @code{make_char}. If you want a pointer to move
3119 through the internal text, use @code{Bufbyte *}. Also note that you 3183 through the internal text, use @code{Ibyte *}. Also note that you
3120 almost certainly do not need @code{Emchar *}. 3184 almost certainly do not need @code{Ichar *}. Other typedefs to clarify
3121 3185 the use of @code{char} are @code{Char_ASCII}, @code{Char_Binary},
3122 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, and @code{Bufpos}. 3186 @code{UChar_Binary}, and @code{CIbyte}.
3187
3188 @item Be careful not to confuse @code{Charcount}, @code{Bytecount}, @code{Charbpos} and @code{Bytebpos}.
3123 The whole point of using different types is to avoid confusion about the 3189 The whole point of using different types is to avoid confusion about the
3124 use of certain variables. Lest this effect be nullified, you need to be 3190 use of certain variables. Lest this effect be nullified, you need to be
3125 careful about using the right types. 3191 careful about using the right types.
3126 3192
3127 @item Always convert external data 3193 @item Always convert external data
3128 It is extremely important to always convert external data, because 3194 It is extremely important to always convert external data, because
3129 XEmacs can crash if unexpected 8bit sequences are copied to its internal 3195 XEmacs can crash if unexpected 8-bit sequences are copied to its internal
3130 buffers literally. 3196 buffers literally.
3131 3197
3132 This means that when a system function, such as @code{readdir}, returns 3198 This means that when a system function, such as @code{readdir}, returns
3133 a string, you may need to convert it using one of the conversion macros 3199 a string, you may need to convert it using one of the conversion macros
3134 described in the previous chapter, before passing it further to Lisp. 3200 described in the previous chapter, before passing it further to Lisp.
3135 3201
3136 Actually, most of the basic system functions that accept '\0'-terminated 3202 Actually, most of the basic system functions that accept '\0'-terminated
3137 string arguments, like @code{stat()} and @code{open()}, have been 3203 string arguments, like @code{stat()} and @code{open()}, have
3138 @strong{encapsulated} so that they are they @code{always} do internal to 3204 @strong{encapsulated} equivalents that do the internal to external
3139 external conversion themselves. This means you must pass internally 3205 conversion themselves. The encapsulated equivalents have a @code{qxe_}
3140 encoded data, typically the @code{XSTRING_DATA} of a Lisp_String to 3206 prefix and have string arguments of type @code{Ibyte *}, and you can
3141 these functions. This is actually a design bug, since it unexpectedly 3207 pass internally encoded data to them, often from a Lisp string using
3142 changes the semantics of the system functions. A better design would be 3208 @code{XSTRING_DATA}. (A better design might be to provide versions that
3143 to provide separate versions of these system functions that accepted 3209 accept Lisp strings directly.)
3144 Lisp_Objects which were lisp strings in place of their current
3145 @code{char *} arguments.
3146
3147 @example
3148 int stat_lisp (Lisp_Object path, struct stat *buf); /* Implement me */
3149 @end example
3150 3210
3151 Also note that many internal functions, such as @code{make_string}, 3211 Also note that many internal functions, such as @code{make_string},
3152 accept Bufbytes, which removes the need for them to convert the data 3212 accept Ibytes, which removes the need for them to convert the data they
3153 they receive. This increases efficiency because that way external data 3213 receive. This increases efficiency because that way external data needs
3154 needs to be decoded only once, when it is read. After that, it is 3214 to be decoded only once, when it is read. After that, it is passed
3155 passed around in internal format. 3215 around in internal format.
3216
3217 @item Do all work in internal format
3218 External-formatted data is completely unpredictable in its format. It
3219 may be Unicode (non-ASCII compatible); it may be a modal encoding, in
3220 which case some occurrences of (e.g.) the slash character may be part of
3221 two-byte Asian-language characters, and a naive attempt to split apart a
3222 pathname by slashes will fail; etc. Internal-format text should be
3223 converted to external format only at the point where an external API is
3224 actually called, and the first thing done after receiving
3225 external-format text from an external API should be to convert it to
3226 internal text.
3156 @end table 3227 @end table
3157 3228
3158 @node An Example of Mule-Aware Code 3229 @node An Example of Mule-Aware Code
3159 @subsection An Example of Mule-Aware Code 3230 @subsection An Example of Mule-Aware Code
3160 @cindex code, an example of Mule-aware 3231 @cindex code, an example of Mule-aware
3169 DEFUN ("string", Fstring, 0, MANY, 0, /* 3240 DEFUN ("string", Fstring, 0, MANY, 0, /*
3170 Concatenate all the argument characters and make the result a string. 3241 Concatenate all the argument characters and make the result a string.
3171 */ 3242 */
3172 (int nargs, Lisp_Object *args)) 3243 (int nargs, Lisp_Object *args))
3173 @{ 3244 @{
3174 Bufbyte *storage = alloca_array (Bufbyte, nargs * MAX_EMCHAR_LEN); 3245 Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN);
3175 Bufbyte *p = storage; 3246 Ibyte *p = storage;
3176 3247
3177 for (; nargs; nargs--, args++) 3248 for (; nargs; nargs--, args++)
3178 @{ 3249 @{
3179 Lisp_Object lisp_char = *args; 3250 Lisp_Object lisp_char = *args;
3180 CHECK_CHAR_COERCE_INT (lisp_char); 3251 CHECK_CHAR_COERCE_INT (lisp_char);
3181 p += set_charptr_emchar (p, XCHAR (lisp_char)); 3252 p += set_itext_ichar (p, XCHAR (lisp_char));
3182 @} 3253 @}
3183 return make_string (storage, p - storage); 3254 return make_string (storage, p - storage);
3184 @} 3255 @}
3185 @end group 3256 @end group
3186 @end example 3257 @end example
3187 3258
3188 Now we can analyze the source line by line. 3259 Now we can analyze the source line by line.
3189 3260
3190 Obviously, string will be as long as there are arguments to the 3261 Obviously, string will be as long as there are arguments to the
3191 function. This is why we allocate @code{MAX_EMCHAR_LEN} * @var{nargs} 3262 function. This is why we allocate @code{MAX_ICHAR_LEN} * @var{nargs}
3192 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs} 3263 bytes on the stack, i.e. the worst-case number of bytes for @var{nargs}
3193 @code{Emchar}s to fit in the string. 3264 @code{Ichar}s to fit in the string.
3194 3265
3195 Then, the loop checks that each element is a character, converting 3266 Then, the loop checks that each element is a character, converting
3196 integers in the process. Like many other functions in XEmacs, this 3267 integers in the process. Like many other functions in XEmacs, this
3197 function silently accepts integers where characters are expected, for 3268 function silently accepts integers where characters are expected, for
3198 historical and compatibility reasons. Unless you know what you are 3269 historical and compatibility reasons. Unless you know what you are
3199 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)} 3270 doing, @code{CHECK_CHAR} will also suffice. @code{XCHAR (lisp_char)}
3200 extracts the @code{Emchar} from the @code{Lisp_Object}, and 3271 extracts the @code{Ichar} from the @code{Lisp_Object}, and
3201 @code{set_charptr_emchar} stores it to storage, increasing @code{p} in 3272 @code{set_itext_ichar} stores it to storage, increasing @code{p} in
3202 the process. 3273 the process.
3203 3274
3204 Other instructive examples of correct coding under Mule can be found all 3275 Other instructive examples of correct coding under Mule can be found all
3205 over the XEmacs code. For starters, I recommend 3276 over the XEmacs code. For starters, I recommend
3206 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have 3277 @code{Fnormalize_menu_item_name} in @file{menubar.c}. After you have
3207 understood this section of the manual and studied the examples, you can 3278 understood this section of the manual and studied the examples, you can
3208 proceed writing new Mule-aware code. 3279 proceed writing new Mule-aware code.
3280
3281 @node Mule-izing Code
3282 @subsection Mule-izing Code
3283
3284 A lot of code is written without Mule in mind, and needs to be made
3285 Mule-correct or "Mule-ized". There is really no substitute for
3286 line-by-line analysis when doing this, but the following checklist can
3287 help:
3288
3289 @itemize @bullet
3290 @item
3291 Check all uses of @code{XSTRING_DATA}.
3292 @item
3293 Check all uses of @code{build_string} and @code{make_string}.
3294 @item
3295 Check all uses of @code{tolower} and @code{toupper}.
3296 @item
3297 Check object print methods.
3298 @item
3299 Check for use of functions such as @code{write_c_string},
3300 @code{write_fmt_string}, @code{stderr_out}, @code{stdout_out}.
3301 @item
3302 Check all occurrences of @code{char} and correct to one of the other
3303 typedefs described above.
3304 @item
3305 Check all existing uses of @code{TO_EXTERNAL_FORMAT},
3306 @code{TO_INTERNAL_FORMAT}, and any convenience macros (grep for
3307 @samp{EXTERNAL_TO}, @samp{TO_EXTERNAL}, and @samp{TO_SIZED_EXTERNAL}).
3308 @item
3309 In Windows code, string literals may need to be encapsulated with @code{XETEXT}.
3310 @end itemize
3209 3311
3210 @node Techniques for XEmacs Developers 3312 @node Techniques for XEmacs Developers
3211 @section Techniques for XEmacs Developers 3313 @section Techniques for XEmacs Developers
3212 @cindex techniques for XEmacs developers 3314 @cindex techniques for XEmacs developers
3213 @cindex developers, techniques for XEmacs 3315 @cindex developers, techniques for XEmacs
8009 @menu 8111 @menu
8010 * Introduction to Buffers:: A buffer holds a block of text such as a file. 8112 * Introduction to Buffers:: A buffer holds a block of text such as a file.
8011 * The Text in a Buffer:: Representation of the text in a buffer. 8113 * The Text in a Buffer:: Representation of the text in a buffer.
8012 * Buffer Lists:: Keeping track of all buffers. 8114 * Buffer Lists:: Keeping track of all buffers.
8013 * Markers and Extents:: Tagging locations within a buffer. 8115 * Markers and Extents:: Tagging locations within a buffer.
8014 * Bufbytes and Emchars:: Representation of individual characters. 8116 * Ibytes and Ichars:: Representation of individual characters.
8015 * The Buffer Object:: The Lisp object corresponding to a buffer. 8117 * The Buffer Object:: The Lisp object corresponding to a buffer.
8016 @end menu 8118 @end menu
8017 8119
8018 @node Introduction to Buffers 8120 @node Introduction to Buffers
8019 @section Introduction to Buffers 8121 @section Introduction to Buffers
8085 8187
8086 For now, we can view a character as some non-negative integer that 8188 For now, we can view a character as some non-negative integer that
8087 has some shape that defines how it typically appears (e.g. as an 8189 has some shape that defines how it typically appears (e.g. as an
8088 uppercase A). (The exact way in which a character appears depends on the 8190 uppercase A). (The exact way in which a character appears depends on the
8089 font used to display the character.) The internal type of characters in 8191 font used to display the character.) The internal type of characters in
8090 the C code is an @code{Emchar}; this is just an @code{int}, but using a 8192 the C code is an @code{Ichar}; this is just an @code{int}, but using a
8091 symbolic type makes the code clearer. 8193 symbolic type makes the code clearer.
8092 8194
8093 Between every character in a buffer is a @dfn{buffer position} or 8195 Between every character in a buffer is a @dfn{buffer position} or
8094 @dfn{character position}. We can speak of the character before or after 8196 @dfn{character position}. We can speak of the character before or after
8095 a particular buffer position, and when you insert a character at a 8197 a particular buffer position, and when you insert a character at a
8102 Buffer positions are numbered starting at 1. This means that 8204 Buffer positions are numbered starting at 1. This means that
8103 position 1 is before the first character, and position 0 is not 8205 position 1 is before the first character, and position 0 is not
8104 valid. If there are N characters in a buffer, then buffer 8206 valid. If there are N characters in a buffer, then buffer
8105 position N+1 is after the last one, and position N+2 is not valid. 8207 position N+1 is after the last one, and position N+2 is not valid.
8106 8208
8107 The internal makeup of the Emchar integer varies depending on whether 8209 The internal makeup of the Ichar integer varies depending on whether
8108 we have compiled with MULE support. If not, the Emchar integer is an 8210 we have compiled with MULE support. If not, the Ichar integer is an
8109 8-bit integer with possible values from 0 - 255. 0 - 127 are the 8211 8-bit integer with possible values from 0 - 255. 0 - 127 are the
8110 standard ASCII characters, while 128 - 255 are the characters from the 8212 standard ASCII characters, while 128 - 255 are the characters from the
8111 ISO-8859-1 character set. If we have compiled with MULE support, an 8213 ISO-8859-1 character set. If we have compiled with MULE support, an
8112 Emchar is a 19-bit integer, with the various bits having meanings 8214 Ichar is a 19-bit integer, with the various bits having meanings
8113 according to a complex scheme that will be detailed later. The 8215 according to a complex scheme that will be detailed later. The
8114 characters numbered 0 - 255 still have the same meanings as for the 8216 characters numbered 0 - 255 still have the same meanings as for the
8115 non-MULE case, though. 8217 non-MULE case, though.
8116 8218
8117 Internally, the text in a buffer is represented in a fairly simple 8219 Internally, the text in a buffer is represented in a fairly simple
8146 the situation is different. In this case, the space @emph{will} be 8248 the situation is different. In this case, the space @emph{will} be
8147 released back to the operating system. However, this tends to result in a 8249 released back to the operating system. However, this tends to result in a
8148 noticeable speed penalty.) 8250 noticeable speed penalty.)
8149 8251
8150 Astute readers may notice that the text in a buffer is represented as 8252 Astute readers may notice that the text in a buffer is represented as
8151 an array of @emph{bytes}, while (at least in the MULE case) an Emchar is 8253 an array of @emph{bytes}, while (at least in the MULE case) an Ichar is
8152 a 19-bit integer, which clearly cannot fit in a byte. This means (of 8254 a 19-bit integer, which clearly cannot fit in a byte. This means (of
8153 course) that the text in a buffer uses a different representation from 8255 course) that the text in a buffer uses a different representation from
8154 an Emchar: specifically, the 19-bit Emchar becomes a series of one to 8256 an Ichar: specifically, the 19-bit Ichar becomes a series of one to
8155 four bytes. The conversion between these two representations is complex 8257 four bytes. The conversion between these two representations is complex
8156 and will be described later. 8258 and will be described later.
8157 8259
8158 In the non-MULE case, everything is very simple: An Emchar 8260 In the non-MULE case, everything is very simple: An Ichar
8159 is an 8-bit value, which fits neatly into one byte. 8261 is an 8-bit value, which fits neatly into one byte.
8160 8262
8161 If we are given a buffer position and want to retrieve the 8263 If we are given a buffer position and want to retrieve the
8162 character at that position, we need to follow these steps: 8264 character at that position, we need to follow these steps:
8163 8265
8178 position that is @dfn{at} the gap, we always use the memory position at 8280 position that is @dfn{at} the gap, we always use the memory position at
8179 the @emph{beginning}, not at the end, of the gap. 8281 the @emph{beginning}, not at the end, of the gap.
8180 @item 8282 @item
8181 Fetch the appropriate bytes at the determined memory position. 8283 Fetch the appropriate bytes at the determined memory position.
8182 @item 8284 @item
8183 Convert these bytes into an Emchar. 8285 Convert these bytes into an Ichar.
8184 @end enumerate 8286 @end enumerate
8185 8287
8186 In the non-Mule case, (3) and (4) boil down to a simple one-byte 8288 In the non-Mule case, (3) and (4) boil down to a simple one-byte
8187 memory access. 8289 memory access.
8188 8290
8189 Note that we have defined three types of positions in a buffer: 8291 Note that we have defined three types of positions in a buffer:
8190 8292
8191 @enumerate 8293 @enumerate
8192 @item 8294 @item
8193 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Bufpos} 8295 @dfn{buffer positions} or @dfn{character positions}, typedef @code{Charbpos}
8194 @item 8296 @item
8195 @dfn{byte indices}, typedef @code{Bytind} 8297 @dfn{byte indices}, typedef @code{Bytebpos}
8196 @item 8298 @item
8197 @dfn{memory indices}, typedef @code{Memind} 8299 @dfn{memory indices}, typedef @code{Membpos}
8198 @end enumerate 8300 @end enumerate
8199 8301
8200 All three typedefs are just @code{int}s, but defining them this way makes 8302 All three typedefs are just @code{int}s, but defining them this way makes
8201 things a lot clearer. 8303 things a lot clearer.
8202 8304
8203 Most code works with buffer positions. In particular, all Lisp code 8305 Most code works with buffer positions. In particular, all Lisp code
8204 that refers to text in a buffer uses buffer positions. Lisp code does 8306 that refers to text in a buffer uses buffer positions. Lisp code does
8205 not know that byte indices or memory indices exist. 8307 not know that byte indices or memory indices exist.
8206 8308
8207 Finally, we have a typedef for the bytes in a buffer. This is a 8309 Finally, we have a typedef for the bytes in a buffer. This is a
8208 @code{Bufbyte}, which is an unsigned char. Referring to them as 8310 @code{Ibyte}, which is an unsigned char. Referring to them as
8209 Bufbytes underscores the fact that we are working with a string of bytes 8311 Ibytes underscores the fact that we are working with a string of bytes
8210 in the internal Emacs buffer representation rather than in one of a 8312 in the internal Emacs buffer representation rather than in one of a
8211 number of possible alternative representations (e.g. EUC-encoded text, 8313 number of possible alternative representations (e.g. EUC-encoded text,
8212 etc.). 8314 etc.).
8213 8315
8214 @node Buffer Lists 8316 @node Buffer Lists
8274 8376
8275 The important thing here is that markers and extents simply contain 8377 The important thing here is that markers and extents simply contain
8276 buffer positions in them as integers, and every time text is inserted or 8378 buffer positions in them as integers, and every time text is inserted or
8277 deleted, these positions must be updated. In order to minimize the 8379 deleted, these positions must be updated. In order to minimize the
8278 amount of shuffling that needs to be done, the positions in markers and 8380 amount of shuffling that needs to be done, the positions in markers and
8279 extents (there's one per marker, two per extent) are stored in Meminds. 8381 extents (there's one per marker, two per extent) are stored in Membpos's.
8280 This means that they only need to be moved when the text is physically 8382 This means that they only need to be moved when the text is physically
8281 moved in memory; since the gap structure tries to minimize this, it also 8383 moved in memory; since the gap structure tries to minimize this, it also
8282 minimizes the number of marker and extent indices that need to be 8384 minimizes the number of marker and extent indices that need to be
8283 adjusted. Look in @file{insdel.c} for the details of how this works. 8385 adjusted. Look in @file{insdel.c} for the details of how this works.
8284 8386
8288 is no way to determine what markers are in a buffer if you are just 8390 is no way to determine what markers are in a buffer if you are just
8289 given the buffer. Extents remain in a buffer until they are detached 8391 given the buffer. Extents remain in a buffer until they are detached
8290 (which could happen as a result of text being deleted) or the buffer is 8392 (which could happen as a result of text being deleted) or the buffer is
8291 deleted, and primitives do exist to enumerate the extents in a buffer. 8393 deleted, and primitives do exist to enumerate the extents in a buffer.
8292 8394
8293 @node Bufbytes and Emchars 8395 @node Ibytes and Ichars
8294 @section Bufbytes and Emchars 8396 @section Ibytes and Ichars
8295 @cindex Bufbytes and Emchars 8397 @cindex Ibytes and Ichars
8296 @cindex Emchars, Bufbytes and 8398 @cindex Ichars, Ibytes and
8297 8399
8298 Not yet documented. 8400 Not yet documented.
8299 8401
8300 @node The Buffer Object 8402 @node The Buffer Object
8301 @section The Buffer Object 8403 @section The Buffer Object
8402 @cindex character sets and encodings, Mule 8504 @cindex character sets and encodings, Mule
8403 @cindex encodings, Mule character sets and 8505 @cindex encodings, Mule character sets and
8404 8506
8405 Recall that there are two primary ways that text is represented in 8507 Recall that there are two primary ways that text is represented in
8406 XEmacs. The @dfn{buffer} representation sees the text as a series of 8508 XEmacs. The @dfn{buffer} representation sees the text as a series of
8407 bytes (Bufbytes), with a variable number of bytes used per character. 8509 bytes (Ibytes), with a variable number of bytes used per character.
8408 The @dfn{character} representation sees the text as a series of integers 8510 The @dfn{character} representation sees the text as a series of integers
8409 (Emchars), one per character. The character representation is a cleaner 8511 (Ichars), one per character. The character representation is a cleaner
8410 representation from a theoretical standpoint, and is thus used in many 8512 representation from a theoretical standpoint, and is thus used in many
8411 cases when lots of manipulations on a string need to be done. However, 8513 cases when lots of manipulations on a string need to be done. However,
8412 the buffer representation is the standard representation used in both 8514 the buffer representation is the standard representation used in both
8413 Lisp strings and buffers, and because of this, it is the ``default'' 8515 Lisp strings and buffers, and because of this, it is the ``default''
8414 representation that text comes in. The reason for using this 8516 representation that text comes in. The reason for using this
9037 @deftypefunx int Lstream_fgetc (Lstream *@var{stream}) 9139 @deftypefunx int Lstream_fgetc (Lstream *@var{stream})
9038 @deftypefunx void Lstream_fungetc (Lstream *@var{stream}, int @var{c}) 9140 @deftypefunx void Lstream_fungetc (Lstream *@var{stream}, int @var{c})
9039 Function equivalents of the above macros. 9141 Function equivalents of the above macros.
9040 @end deftypefun 9142 @end deftypefun
9041 9143
9042 @deftypefun ssize_t Lstream_read (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) 9144 @deftypefun Bytecount Lstream_read (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
9043 Read @var{size} bytes of @var{data} from the stream. Return the number 9145 Read @var{size} bytes of @var{data} from the stream. Return the number
9044 of bytes read. 0 means EOF. -1 means an error occurred and no bytes 9146 of bytes read. 0 means EOF. -1 means an error occurred and no bytes
9045 were read. 9147 were read.
9046 @end deftypefun 9148 @end deftypefun
9047 9149
9048 @deftypefun ssize_t Lstream_write (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) 9150 @deftypefun Bytecount Lstream_write (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
9049 Write @var{size} bytes of @var{data} to the stream. Return the number 9151 Write @var{size} bytes of @var{data} to the stream. Return the number
9050 of bytes written. -1 means an error occurred and no bytes were written. 9152 of bytes written. -1 means an error occurred and no bytes were written.
9051 @end deftypefun 9153 @end deftypefun
9052 9154
9053 @deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, size_t @var{size}) 9155 @deftypefun void Lstream_unread (Lstream *@var{stream}, void *@var{data}, Bytecount @var{size})
9054 Push back @var{size} bytes of @var{data} onto the input queue. The next 9156 Push back @var{size} bytes of @var{data} onto the input queue. The next
9055 call to @code{Lstream_read()} with the same size will read the same 9157 call to @code{Lstream_read()} with the same size will read the same
9056 bytes back. Note that this will be the case even if there is other 9158 bytes back. Note that this will be the case even if there is other
9057 pending unread data. 9159 pending unread data.
9058 @end deftypefun 9160 @end deftypefun
9074 9176
9075 @node Lstream Methods 9177 @node Lstream Methods
9076 @section Lstream Methods 9178 @section Lstream Methods
9077 @cindex lstream methods 9179 @cindex lstream methods
9078 9180
9079 @deftypefn {Lstream Method} ssize_t reader (Lstream *@var{stream}, unsigned char *@var{data}, size_t @var{size}) 9181 @deftypefn {Lstream Method} Bytecount reader (Lstream *@var{stream}, unsigned char *@var{data}, Bytecount @var{size})
9080 Read some data from the stream's end and store it into @var{data}, which 9182 Read some data from the stream's end and store it into @var{data}, which
9081 can hold @var{size} bytes. Return the number of bytes read. A return 9183 can hold @var{size} bytes. Return the number of bytes read. A return
9082 value of 0 means no bytes can be read at this time. This may be because 9184 value of 0 means no bytes can be read at this time. This may be because
9083 of an EOF, or because there is a granularity greater than one byte that 9185 of an EOF, or because there is a granularity greater than one byte that
9084 the stream imposes on the returned data, and @var{size} is less than 9186 the stream imposes on the returned data, and @var{size} is less than
9091 calls @code{Lstream_read()} with a very small size. 9193 calls @code{Lstream_read()} with a very small size.
9092 9194
9093 This function can be @code{NULL} if the stream is output-only. 9195 This function can be @code{NULL} if the stream is output-only.
9094 @end deftypefn 9196 @end deftypefn
9095 9197
9096 @deftypefn {Lstream Method} ssize_t writer (Lstream *@var{stream}, const unsigned char *@var{data}, size_t @var{size}) 9198 @deftypefn {Lstream Method} Bytecount writer (Lstream *@var{stream}, const unsigned char *@var{data}, Bytecount @var{size})
9097 Send some data to the stream's end. Data to be sent is in @var{data} 9199 Send some data to the stream's end. Data to be sent is in @var{data}
9098 and is @var{size} bytes. Return the number of bytes sent. This 9200 and is @var{size} bytes. Return the number of bytes sent. This
9099 function can send and return fewer bytes than is passed in; in that 9201 function can send and return fewer bytes than is passed in; in that
9100 case, the function will just be called again until there is no data left 9202 case, the function will just be called again until there is no data left
9101 or 0 is returned. A return value of 0 means that no more data can be 9203 or 0 is returned. A return value of 0 means that no more data can be
9692 Similarly, a string may or may not have an extent_info structure. 9794 Similarly, a string may or may not have an extent_info structure.
9693 (Generally it won't if there haven't been any extents added to the 9795 (Generally it won't if there haven't been any extents added to the
9694 string.) So use the @code{_force} version if you need the extent_info 9796 string.) So use the @code{_force} version if you need the extent_info
9695 structure to be there. 9797 structure to be there.
9696 9798
9697 A list of extents is maintained as a double gap array: one gap array 9799 A list of extents is maintained as a double gap array: One gap array
9698 is ordered by start index (the @dfn{display order}) and the other is 9800 is ordered by start index (the @dfn{display order}) and the other is
9699 ordered by end index (the @dfn{e-order}). Note that positions in an 9801 ordered by end index (the @dfn{e-order}). Note that positions in an
9700 extent list should logically be conceived of as referring @emph{to} a 9802 extent list should logically be conceived of as referring @emph{to} a
9701 particular extent (as is the norm in programs) rather than sitting 9803 particular extent (as is the norm in programs) rather than sitting
9702 between two extents. Note also that callers of these functions should 9804 between two extents. Note also that callers of these functions should
9703 not be aware of the fact that the extent list is implemented as an 9805 not be aware of the fact that the extent list is implemented as an
9704 array, except for the fact that positions are integers (this should be 9806 array, except for the fact that positions are integers (this should be
9705 generalized to handle integers and linked list equally well). 9807 generalized to handle integers and linked list equally well).
9808
9809 A gap array is the same structure used by buffer text: an array of
9810 elements with a "gap" somewhere in the middle. Insertion and deletion
9811 happens by moving the gap to the insertion/deletion point, and then
9812 expanding/contracting as necessary. Gap arrays have a number of
9813 useful properties:
9814
9815 @enumerate
9816 @item
9817 They are space efficient, as there is no need for next/previous pointers.
9818
9819 @item
9820 If the items in them are sorted, locating an item is fast -- @math{O(log N)}.
9821
9822 @item
9823 Insertion and deletion is very fast (constant time, essentially) if the
9824 gap is near (which favors localized operations, as will usually be the
9825 case). Even if not, it requires only a block move of memory, which is
9826 generally a highly optimized operation on modern processors.
9827
9828 @item
9829 Code to manipulate them is relatively simple to write.
9830 @end enumerate
9831
9832 An alternative would be a balanced binary trees, which have guaranteed
9833 @math{O(log N)} time for all operations (although the constant factors
9834 are not as good, and repeated localized operations will be slower than
9835 for a gap array). Such code is quite tricky to write, however.
9706 9836
9707 @node Zero-Length Extents 9837 @node Zero-Length Extents
9708 @section Zero-Length Extents 9838 @section Zero-Length Extents
9709 @cindex zero-length extents 9839 @cindex zero-length extents
9710 @cindex extents, zero-length 9840 @cindex extents, zero-length
9829 This is the analog of Theorem 1, and applies because the e-order 9959 This is the analog of Theorem 1, and applies because the e-order
9830 sorts by increasing ending index. 9960 sorts by increasing ending index.
9831 9961
9832 Therefore, @math{F} can be found in the same amount of time as 9962 Therefore, @math{F} can be found in the same amount of time as
9833 operation (1), i.e. the time that it takes to locate where an extent 9963 operation (1), i.e. the time that it takes to locate where an extent
9834 would go if inserted into the e-order list. 9964 would go if inserted into the e-order list. This is @math{O(log N)},
9835 9965 since we are using gap arrays to manage extents.
9836 If the lists were stored as balanced binary trees, then operation (1)
9837 would take logarithmic time, which is usually quite fast. However,
9838 currently they're stored as simple doubly-linked lists, and instead we
9839 do some caching to try to speed things up.
9840 9966
9841 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents 9967 Define a @dfn{stack of extents} (or @dfn{SOE}) as the set of extents
9842 (ordered in the display order) that overlap an index @math{I}, together 9968 (ordered in display order and e-order, just like for normal extent
9843 with the SOE's @dfn{previous} extent, which is an extent that precedes 9969 lists) that overlap an index @math{I}.
9844 @math{I} in the e-order. (Hopefully there will not be very many extents
9845 between @math{I} and the previous extent.)
9846 9970
9847 Now: 9971 Now:
9848 9972
9849 Let @math{I} be an index, let @math{S} be the stack of extents on 9973 Let @math{I} be an index, let @math{S} be the stack of extents on
9850 @math{I}, let @math{F} be the first extent in @math{S}, and let @math{P} 9974 @math{I} and let @math{F} be the first extent in @math{S}.
9851 be @math{S}'s previous extent.
9852 9975
9853 Theorem 3: The first extent in @math{S} is the first extent that overlaps 9976 Theorem 3: The first extent in @math{S} is the first extent that overlaps
9854 any range @math{[I, J]}. 9977 any range @math{[I, J]}.
9855 9978
9856 Proof: Any extent that overlaps @math{[I, J]} but does not include 9979 Proof: Any extent that overlaps @math{[I, J]} but does not include