comparison man/internals/internals.texi @ 1496:e7b471ce22c1

[xemacs-hg @ 2003-05-22 09:57:53 by stephent] intro to matching <87k7cjs3tb.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Thu, 22 May 2003 09:57:57 +0000
parents 1b0339b048ce
children 2f1ad95e2378
comparison
equal deleted inserted replaced
1495:c3cf7db99b98 1496:e7b471ce22c1
268 * The Text in a Buffer:: Representation of the text in a buffer. 268 * The Text in a Buffer:: Representation of the text in a buffer.
269 * Buffer Lists:: Keeping track of all buffers. 269 * Buffer Lists:: Keeping track of all buffers.
270 * Markers and Extents:: Tagging locations within a buffer. 270 * Markers and Extents:: Tagging locations within a buffer.
271 * Ibytes and Ichars:: Representation of individual characters. 271 * Ibytes and Ichars:: Representation of individual characters.
272 * The Buffer Object:: The Lisp object corresponding to a buffer. 272 * The Buffer Object:: The Lisp object corresponding to a buffer.
273 * Searching and Matching:: Higher-level algorithms.
273 274
274 MULE Character Sets and Encodings 275 MULE Character Sets and Encodings
275 276
276 * Character Sets:: 277 * Character Sets::
277 * Encodings:: 278 * Encodings::
8181 * The Text in a Buffer:: Representation of the text in a buffer. 8182 * The Text in a Buffer:: Representation of the text in a buffer.
8182 * Buffer Lists:: Keeping track of all buffers. 8183 * Buffer Lists:: Keeping track of all buffers.
8183 * Markers and Extents:: Tagging locations within a buffer. 8184 * Markers and Extents:: Tagging locations within a buffer.
8184 * Ibytes and Ichars:: Representation of individual characters. 8185 * Ibytes and Ichars:: Representation of individual characters.
8185 * The Buffer Object:: The Lisp object corresponding to a buffer. 8186 * The Buffer Object:: The Lisp object corresponding to a buffer.
8187 * Searching and Matching:: Higher-level algorithms.
8186 @end menu 8188 @end menu
8187 8189
8188 @node Introduction to Buffers 8190 @node Introduction to Buffers
8189 @section Introduction to Buffers 8191 @section Introduction to Buffers
8190 @cindex buffers, introduction to 8192 @cindex buffers, introduction to
8563 8565
8564 @item base_buffer 8566 @item base_buffer
8565 This field holds the buffer's base buffer (if it is an indirect buffer), 8567 This field holds the buffer's base buffer (if it is an indirect buffer),
8566 or @code{nil}. 8568 or @code{nil}.
8567 @end table 8569 @end table
8570
8571 @node Searching and Matching
8572 @section Searching and Matching
8573 @cindex searching
8574 @cindex matching
8575
8576 Very incomplete, limited to a brief introduction.
8577
8578 People find the searching and matching code difficult to understand.
8579 And indeed, the details are hard. However, the basic structures are not
8580 so complex. First, there's a hard question with a simple answer. What
8581 about Mule? The answer here is that it turns out that Mule characters
8582 can be matched byte by byte, so neither the search code nor the regular
8583 expression code need take much notice of it at all! Of course, we add
8584 some special features (such as regular expressions that match only
8585 certain charsets), but these do not require new concepts. The main
8586 exception is that wild-card matches in Mule have to be careful to
8587 swallow whole characters. This is handled using the same basic macros
8588 that are used for buffer and string movements.
8589
8590 The complex algorithms for searching are for simple string searches. In
8591 particular, the algorithm used for fast string searching is Boyer-Moore.
8592 This algorithm is based on the idea that if you have a mismatch at a
8593 given position, you can precompute where to restart the search. This
8594 typically means that you can often make many fewer than N character
8595 comparisons, where N is the position at which the match is found, or the
8596 size of the text if it contains no match. That's fast! But it's not
8597 easy. You must ``compile'' the search string into a jump table. See
8598 the source, @file{search.c}, for more information.
8599
8600 Emacs changes the basic algorithms somewhat in order to handle
8601 case-insensitive searches without a full-blown regular expression.
8602
8603 Regular expressions, on the other hand, have a trivial search
8604 implementation: try a match at each position. (Under POSIX rules, it's
8605 a bit more complex, because POSIX requires that you find the
8606 @emph{longest} match in the text. This means you keep a record of the
8607 best match so far, and find all the matches.)
8608
8609 The matching code for regular expressions is quite complex. First, the
8610 regular expression itself is compiled. There are two basic approaches
8611 that could be taken. The first is to compile the expression into tables
8612 to drive a generic finite automaton emulator. This is the approach
8613 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi,
8614 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka
8615 ``The Dragon Book'') as well as being used by the @file{lex} family of
8616 lexical analysis engines.
8617
8618 Emacs uses a somewhat different technique. The expression is compiled
8619 into a form of bytecode, which is interpreted by a special interpreter.
8620 The interpreter itself basically amounts to an inline implementation of
8621 the finite automaton emulator. The advantage of this technique is that
8622 it's easier to add special features, such as control of case-sensitivity
8623 via a global variable.
8624
8625 The compiler is not treated here. See the source, @file{regex.c}. The
8626 interpreter, although it is divided into several functions, and looks
8627 fearsomely complex, is actually quite simple in concept. However,
8628 basically what you're doing there is a strcmp on steroids, right?
8629
8630 @example
8631 int
8632 strcmp (char *p, /* pattern pointer */
8633 char *b) /* buffer pointer */
8634 @{
8635 while (*p++ == *b++)
8636 ;
8637 return *(--p) - *(--b); /* oops, we overshot */
8638 @}
8639 @end example
8640
8641 Really, it's no harder than that. (A bit of a white lie, OK?)
8642
8643 How does the regexp code generalize this?
8644
8645 @enumerate
8646 @item
8647 Depending on the pattern, @code{*b} may have a general relationship to
8648 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is
8649 generalized to include checks for set membership, and context dependent
8650 properties. This depends on @code{&*b}. Of course that's meaningless
8651 in C, so we use @code{b} directly, instead.
8652
8653 @item
8654 Although to ensure the algorithm terminates, @code{b} must advance step
8655 by step, @code{p} can branch and jump.
8656
8657 @item
8658 The information returned is much greater, including information about
8659 subexpressions.
8660 @end enumerate
8661
8662 We'll ignore (3). (2) is mostly interesting when compiling the regular
8663 expression. Now we have
8664
8665 @example
8666 @group
8667 enum operator_t @{
8668 accept = 0,
8669 exact,
8670 any,
8671 range,
8672 group, /* actually, these are probably */
8673 repeat, /* turned into conditional code */
8674 /* etc */
8675 @};
8676 @end group
8677
8678 @group
8679 enum status_t @{
8680 working = 0,
8681 matched,
8682 mismatch,
8683 end_of_buffer,
8684 error
8685 @};
8686 @end group
8687
8688 @group
8689 struct pattern @{
8690 enum operator_t operator;
8691 char char_value;
8692 boolean range_table[256];
8693 /* etc, etc */
8694 @};
8695 @end group
8696
8697 @group
8698 char *p, /* pattern pointer */
8699 *b; /* buffer pointer */
8700
8701 enum status_t
8702 match (struct pattern *p, char *b)
8703 @{
8704 enum status_t done = working;
8705
8706 while (!(done = match_1_operator (p, b)))
8707 @{
8708 struct pattern *p1 = p;
8709 p = next_p (p, b);
8710 b = next_b (p1, b);
8711 @}
8712 return done;
8713 @}
8714 @end group
8715 @end example
8716
8717 This format exposes the underlying finite automaton.
8718
8719 All of them have the following structure, except that the @samp{next_*}
8720 functions decide where to jump (for @samp{p}) and whether or not to
8721 increment (for @samp{b}), rather than checking for satisfaction of a
8722 matching condition.
8723
8724 @example
8725 enum status_t
8726 match_1_operator (pattern *p, char *b)
8727 @{
8728 if (! *b) return end_of_buffer;
8729 switch (p->operator)
8730 @{
8731 case accept:
8732 return matched;
8733 case exact:
8734 if (*b != p->char_value) return mismatch; else break;
8735 case any:
8736 break;
8737 case range:
8738 /* range_table is computed in the regexp_compile function */
8739 if (! p->range_table[*b]) return mismatch;
8740 /* etc, etc */
8741 @}
8742 return working;
8743 @}
8744 @end example
8745
8746 Grouping, repetition, and alternation are handled by compiling the
8747 subexpression and calling @code{match (p->subpattern, b)} recursively.
8748
8749 In terms of reading the actual code, there are five optimizations
8750 (obfuscations, if you like) that have been done.
8751
8752 @enumerate
8753 @item
8754 An explicit "failure stack" has been substituted for recursion.
8755
8756 @item
8757 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions
8758 are actually inlined into the @code{match} function for efficiency.
8759 Then the pointer movement is interspersed with the matching operations.
8760
8761 @item
8762 If the operator uses buffer context, the buffer pointer movement is
8763 sometimes implicit in the operations retrieving the context.
8764
8765 @item
8766 Some cases are combined into short preparation for individual cases, and
8767 a "fall-through" into combined code for several cases.
8768
8769 @item
8770 The @code{pattern} type is not an explicit @samp{struct}. Instead, the
8771 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the
8772 compiled bytecode. This leads to bizarre code in the interpreter like
8773
8774 @example
8775 case range:
8776 p += *(p + 1); break;
8777 @end example
8778
8779 in @code{next_p}, because the compiled pattern is laid out
8780
8781 @example
8782 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ...
8783 @end example
8784 @end enumerate
8785
8786 But if you keep your eye on the "switch in a loop" structure, you
8787 should be able to understand the parts you need.
8788
8568 8789
8569 @node MULE Character Sets and Encodings, The Lisp Reader and Compiler, Buffers and Textual Representation, Top 8790 @node MULE Character Sets and Encodings, The Lisp Reader and Compiler, Buffers and Textual Representation, Top
8570 @chapter MULE Character Sets and Encodings 8791 @chapter MULE Character Sets and Encodings
8571 @cindex Mule character sets and encodings 8792 @cindex Mule character sets and encodings
8572 @cindex character sets and encodings, Mule 8793 @cindex character sets and encodings, Mule