Mercurial > hg > xemacs-beta
comparison man/internals/internals.texi @ 1496:e7b471ce22c1
[xemacs-hg @ 2003-05-22 09:57:53 by stephent]
intro to matching <87k7cjs3tb.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Thu, 22 May 2003 09:57:57 +0000 |
parents | 1b0339b048ce |
children | 2f1ad95e2378 |
comparison
equal
deleted
inserted
replaced
1495:c3cf7db99b98 | 1496:e7b471ce22c1 |
---|---|
268 * The Text in a Buffer:: Representation of the text in a buffer. | 268 * The Text in a Buffer:: Representation of the text in a buffer. |
269 * Buffer Lists:: Keeping track of all buffers. | 269 * Buffer Lists:: Keeping track of all buffers. |
270 * Markers and Extents:: Tagging locations within a buffer. | 270 * Markers and Extents:: Tagging locations within a buffer. |
271 * Ibytes and Ichars:: Representation of individual characters. | 271 * Ibytes and Ichars:: Representation of individual characters. |
272 * The Buffer Object:: The Lisp object corresponding to a buffer. | 272 * The Buffer Object:: The Lisp object corresponding to a buffer. |
273 * Searching and Matching:: Higher-level algorithms. | |
273 | 274 |
274 MULE Character Sets and Encodings | 275 MULE Character Sets and Encodings |
275 | 276 |
276 * Character Sets:: | 277 * Character Sets:: |
277 * Encodings:: | 278 * Encodings:: |
8181 * The Text in a Buffer:: Representation of the text in a buffer. | 8182 * The Text in a Buffer:: Representation of the text in a buffer. |
8182 * Buffer Lists:: Keeping track of all buffers. | 8183 * Buffer Lists:: Keeping track of all buffers. |
8183 * Markers and Extents:: Tagging locations within a buffer. | 8184 * Markers and Extents:: Tagging locations within a buffer. |
8184 * Ibytes and Ichars:: Representation of individual characters. | 8185 * Ibytes and Ichars:: Representation of individual characters. |
8185 * The Buffer Object:: The Lisp object corresponding to a buffer. | 8186 * The Buffer Object:: The Lisp object corresponding to a buffer. |
8187 * Searching and Matching:: Higher-level algorithms. | |
8186 @end menu | 8188 @end menu |
8187 | 8189 |
8188 @node Introduction to Buffers | 8190 @node Introduction to Buffers |
8189 @section Introduction to Buffers | 8191 @section Introduction to Buffers |
8190 @cindex buffers, introduction to | 8192 @cindex buffers, introduction to |
8563 | 8565 |
8564 @item base_buffer | 8566 @item base_buffer |
8565 This field holds the buffer's base buffer (if it is an indirect buffer), | 8567 This field holds the buffer's base buffer (if it is an indirect buffer), |
8566 or @code{nil}. | 8568 or @code{nil}. |
8567 @end table | 8569 @end table |
8570 | |
8571 @node Searching and Matching | |
8572 @section Searching and Matching | |
8573 @cindex searching | |
8574 @cindex matching | |
8575 | |
8576 Very incomplete, limited to a brief introduction. | |
8577 | |
8578 People find the searching and matching code difficult to understand. | |
8579 And indeed, the details are hard. However, the basic structures are not | |
8580 so complex. First, there's a hard question with a simple answer. What | |
8581 about Mule? The answer here is that it turns out that Mule characters | |
8582 can be matched byte by byte, so neither the search code nor the regular | |
8583 expression code need take much notice of it at all! Of course, we add | |
8584 some special features (such as regular expressions that match only | |
8585 certain charsets), but these do not require new concepts. The main | |
8586 exception is that wild-card matches in Mule have to be careful to | |
8587 swallow whole characters. This is handled using the same basic macros | |
8588 that are used for buffer and string movements. | |
8589 | |
8590 The complex algorithms for searching are for simple string searches. In | |
8591 particular, the algorithm used for fast string searching is Boyer-Moore. | |
8592 This algorithm is based on the idea that if you have a mismatch at a | |
8593 given position, you can precompute where to restart the search. This | |
8594 typically means that you can often make many fewer than N character | |
8595 comparisons, where N is the position at which the match is found, or the | |
8596 size of the text if it contains no match. That's fast! But it's not | |
8597 easy. You must ``compile'' the search string into a jump table. See | |
8598 the source, @file{search.c}, for more information. | |
8599 | |
8600 Emacs changes the basic algorithms somewhat in order to handle | |
8601 case-insensitive searches without a full-blown regular expression. | |
8602 | |
8603 Regular expressions, on the other hand, have a trivial search | |
8604 implementation: try a match at each position. (Under POSIX rules, it's | |
8605 a bit more complex, because POSIX requires that you find the | |
8606 @emph{longest} match in the text. This means you keep a record of the | |
8607 best match so far, and find all the matches.) | |
8608 | |
8609 The matching code for regular expressions is quite complex. First, the | |
8610 regular expression itself is compiled. There are two basic approaches | |
8611 that could be taken. The first is to compile the expression into tables | |
8612 to drive a generic finite automaton emulator. This is the approach | |
8613 given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi, | |
8614 and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka | |
8615 ``The Dragon Book'') as well as being used by the @file{lex} family of | |
8616 lexical analysis engines. | |
8617 | |
8618 Emacs uses a somewhat different technique. The expression is compiled | |
8619 into a form of bytecode, which is interpreted by a special interpreter. | |
8620 The interpreter itself basically amounts to an inline implementation of | |
8621 the finite automaton emulator. The advantage of this technique is that | |
8622 it's easier to add special features, such as control of case-sensitivity | |
8623 via a global variable. | |
8624 | |
8625 The compiler is not treated here. See the source, @file{regex.c}. The | |
8626 interpreter, although it is divided into several functions, and looks | |
8627 fearsomely complex, is actually quite simple in concept. However, | |
8628 basically what you're doing there is a strcmp on steroids, right? | |
8629 | |
8630 @example | |
8631 int | |
8632 strcmp (char *p, /* pattern pointer */ | |
8633 char *b) /* buffer pointer */ | |
8634 @{ | |
8635 while (*p++ == *b++) | |
8636 ; | |
8637 return *(--p) - *(--b); /* oops, we overshot */ | |
8638 @} | |
8639 @end example | |
8640 | |
8641 Really, it's no harder than that. (A bit of a white lie, OK?) | |
8642 | |
8643 How does the regexp code generalize this? | |
8644 | |
8645 @enumerate | |
8646 @item | |
8647 Depending on the pattern, @code{*b} may have a general relationship to | |
8648 @code{*p}. @emph{I.e.}, direct comparison against @code{*p} is | |
8649 generalized to include checks for set membership, and context dependent | |
8650 properties. This depends on @code{&*b}. Of course that's meaningless | |
8651 in C, so we use @code{b} directly, instead. | |
8652 | |
8653 @item | |
8654 Although to ensure the algorithm terminates, @code{b} must advance step | |
8655 by step, @code{p} can branch and jump. | |
8656 | |
8657 @item | |
8658 The information returned is much greater, including information about | |
8659 subexpressions. | |
8660 @end enumerate | |
8661 | |
8662 We'll ignore (3). (2) is mostly interesting when compiling the regular | |
8663 expression. Now we have | |
8664 | |
8665 @example | |
8666 @group | |
8667 enum operator_t @{ | |
8668 accept = 0, | |
8669 exact, | |
8670 any, | |
8671 range, | |
8672 group, /* actually, these are probably */ | |
8673 repeat, /* turned into conditional code */ | |
8674 /* etc */ | |
8675 @}; | |
8676 @end group | |
8677 | |
8678 @group | |
8679 enum status_t @{ | |
8680 working = 0, | |
8681 matched, | |
8682 mismatch, | |
8683 end_of_buffer, | |
8684 error | |
8685 @}; | |
8686 @end group | |
8687 | |
8688 @group | |
8689 struct pattern @{ | |
8690 enum operator_t operator; | |
8691 char char_value; | |
8692 boolean range_table[256]; | |
8693 /* etc, etc */ | |
8694 @}; | |
8695 @end group | |
8696 | |
8697 @group | |
8698 char *p, /* pattern pointer */ | |
8699 *b; /* buffer pointer */ | |
8700 | |
8701 enum status_t | |
8702 match (struct pattern *p, char *b) | |
8703 @{ | |
8704 enum status_t done = working; | |
8705 | |
8706 while (!(done = match_1_operator (p, b))) | |
8707 @{ | |
8708 struct pattern *p1 = p; | |
8709 p = next_p (p, b); | |
8710 b = next_b (p1, b); | |
8711 @} | |
8712 return done; | |
8713 @} | |
8714 @end group | |
8715 @end example | |
8716 | |
8717 This format exposes the underlying finite automaton. | |
8718 | |
8719 All of them have the following structure, except that the @samp{next_*} | |
8720 functions decide where to jump (for @samp{p}) and whether or not to | |
8721 increment (for @samp{b}), rather than checking for satisfaction of a | |
8722 matching condition. | |
8723 | |
8724 @example | |
8725 enum status_t | |
8726 match_1_operator (pattern *p, char *b) | |
8727 @{ | |
8728 if (! *b) return end_of_buffer; | |
8729 switch (p->operator) | |
8730 @{ | |
8731 case accept: | |
8732 return matched; | |
8733 case exact: | |
8734 if (*b != p->char_value) return mismatch; else break; | |
8735 case any: | |
8736 break; | |
8737 case range: | |
8738 /* range_table is computed in the regexp_compile function */ | |
8739 if (! p->range_table[*b]) return mismatch; | |
8740 /* etc, etc */ | |
8741 @} | |
8742 return working; | |
8743 @} | |
8744 @end example | |
8745 | |
8746 Grouping, repetition, and alternation are handled by compiling the | |
8747 subexpression and calling @code{match (p->subpattern, b)} recursively. | |
8748 | |
8749 In terms of reading the actual code, there are five optimizations | |
8750 (obfuscations, if you like) that have been done. | |
8751 | |
8752 @enumerate | |
8753 @item | |
8754 An explicit "failure stack" has been substituted for recursion. | |
8755 | |
8756 @item | |
8757 The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions | |
8758 are actually inlined into the @code{match} function for efficiency. | |
8759 Then the pointer movement is interspersed with the matching operations. | |
8760 | |
8761 @item | |
8762 If the operator uses buffer context, the buffer pointer movement is | |
8763 sometimes implicit in the operations retrieving the context. | |
8764 | |
8765 @item | |
8766 Some cases are combined into short preparation for individual cases, and | |
8767 a "fall-through" into combined code for several cases. | |
8768 | |
8769 @item | |
8770 The @code{pattern} type is not an explicit @samp{struct}. Instead, the | |
8771 data (including, @emph{e.g.}, @samp{range_table}) is inlined into the | |
8772 compiled bytecode. This leads to bizarre code in the interpreter like | |
8773 | |
8774 @example | |
8775 case range: | |
8776 p += *(p + 1); break; | |
8777 @end example | |
8778 | |
8779 in @code{next_p}, because the compiled pattern is laid out | |
8780 | |
8781 @example | |
8782 ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ... | |
8783 @end example | |
8784 @end enumerate | |
8785 | |
8786 But if you keep your eye on the "switch in a loop" structure, you | |
8787 should be able to understand the parts you need. | |
8788 | |
8568 | 8789 |
8569 @node MULE Character Sets and Encodings, The Lisp Reader and Compiler, Buffers and Textual Representation, Top | 8790 @node MULE Character Sets and Encodings, The Lisp Reader and Compiler, Buffers and Textual Representation, Top |
8570 @chapter MULE Character Sets and Encodings | 8791 @chapter MULE Character Sets and Encodings |
8571 @cindex Mule character sets and encodings | 8792 @cindex Mule character sets and encodings |
8572 @cindex character sets and encodings, Mule | 8793 @cindex character sets and encodings, Mule |