Mercurial > hg > xemacs-beta
changeset 1496:e7b471ce22c1
[xemacs-hg @ 2003-05-22 09:57:53 by stephent]
intro to matching <87k7cjs3tb.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Thu, 22 May 2003 09:57:57 +0000 |
parents | c3cf7db99b98 |
children | 0ad2242c43a4 |
files | man/ChangeLog man/internals/internals.texi |
diffstat | 2 files changed, 225 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/man/ChangeLog Thu May 22 07:41:27 2003 +0000 +++ b/man/ChangeLog Thu May 22 09:57:57 2003 +0000 @@ -1,3 +1,7 @@ +2003-05-22 Stephen J. Turnbull <stephen@xemacs.org> + + * internals/internals.texi (Searching and Matching): New node. + 2003-05-17 Stephen J. Turnbull <stephen@xemacs.org> * xemacs-faq.texi (detail menu): Reformat "Current Events" caption.
--- a/man/internals/internals.texi Thu May 22 07:41:27 2003 +0000 +++ b/man/internals/internals.texi Thu May 22 09:57:57 2003 +0000 @@ -270,6 +270,7 @@ * Markers and Extents:: Tagging locations within a buffer. * Ibytes and Ichars:: Representation of individual characters. * The Buffer Object:: The Lisp object corresponding to a buffer. +* Searching and Matching:: Higher-level algorithms. MULE Character Sets and Encodings @@ -8183,6 +8184,7 @@ * Markers and Extents:: Tagging locations within a buffer. * Ibytes and Ichars:: Representation of individual characters. * The Buffer Object:: The Lisp object corresponding to a buffer. +* Searching and Matching:: Higher-level algorithms. @end menu @node Introduction to Buffers @@ -8566,6 +8568,225 @@ or @code{nil}. @end table +@node Searching and Matching +@section Searching and Matching +@cindex searching +@cindex matching + +Very incomplete, limited to a brief introduction. + +People find the searching and matching code difficult to understand. +And indeed, the details are hard. However, the basic structures are not +so complex. First, there's a hard question with a simple answer. What +about Mule? The answer here is that it turns out that Mule characters +can be matched byte by byte, so neither the search code nor the regular +expression code need take much notice of it at all! Of course, we add +some special features (such as regular expressions that match only +certain charsets), but these do not require new concepts. The main +exception is that wild-card matches in Mule have to be careful to +swallow whole characters. This is handled using the same basic macros +that are used for buffer and string movements. + +The complex algorithms for searching are for simple string searches. In +particular, the algorithm used for fast string searching is Boyer-Moore. +This algorithm is based on the idea that if you have a mismatch at a +given position, you can precompute where to restart the search. This +typically means that you can often make many fewer than N character +comparisons, where N is the position at which the match is found, or the +size of the text if it contains no match. That's fast! But it's not +easy. You must ``compile'' the search string into a jump table. See +the source, @file{search.c}, for more information. + +Emacs changes the basic algorithms somewhat in order to handle +case-insensitive searches without a full-blown regular expression. + +Regular expressions, on the other hand, have a trivial search +implementation: try a match at each position. (Under POSIX rules, it's +a bit more complex, because POSIX requires that you find the +@emph{longest} match in the text. This means you keep a record of the +best match so far, and find all the matches.) + +The matching code for regular expressions is quite complex. First, the +regular expression itself is compiled. There are two basic approaches +that could be taken. The first is to compile the expression into tables +to drive a generic finite automaton emulator. This is the approach +given in many textbooks (Sedgewick's @emph{Algorithms} and Aho, Sethi, +and Ullmann's @emph{Compilers: Principles, Techniques, and Tools}, aka +``The Dragon Book'') as well as being used by the @file{lex} family of +lexical analysis engines. + +Emacs uses a somewhat different technique. The expression is compiled +into a form of bytecode, which is interpreted by a special interpreter. +The interpreter itself basically amounts to an inline implementation of +the finite automaton emulator. The advantage of this technique is that +it's easier to add special features, such as control of case-sensitivity +via a global variable. + +The compiler is not treated here. See the source, @file{regex.c}. The +interpreter, although it is divided into several functions, and looks +fearsomely complex, is actually quite simple in concept. However, +basically what you're doing there is a strcmp on steroids, right? + +@example +int +strcmp (char *p, /* pattern pointer */ + char *b) /* buffer pointer */ +@{ + while (*p++ == *b++) + ; + return *(--p) - *(--b); /* oops, we overshot */ +@} +@end example + +Really, it's no harder than that. (A bit of a white lie, OK?) + +How does the regexp code generalize this? + +@enumerate +@item +Depending on the pattern, @code{*b} may have a general relationship to +@code{*p}. @emph{I.e.}, direct comparison against @code{*p} is +generalized to include checks for set membership, and context dependent +properties. This depends on @code{&*b}. Of course that's meaningless +in C, so we use @code{b} directly, instead. + +@item +Although to ensure the algorithm terminates, @code{b} must advance step +by step, @code{p} can branch and jump. + +@item +The information returned is much greater, including information about +subexpressions. +@end enumerate + +We'll ignore (3). (2) is mostly interesting when compiling the regular +expression. Now we have + +@example +@group +enum operator_t @{ + accept = 0, + exact, + any, + range, + group, /* actually, these are probably */ + repeat, /* turned into conditional code */ + /* etc */ +@}; +@end group + +@group +enum status_t @{ + working = 0, + matched, + mismatch, + end_of_buffer, + error + @}; +@end group + +@group +struct pattern @{ + enum operator_t operator; + char char_value; + boolean range_table[256]; + /* etc, etc */ + @}; +@end group + +@group +char *p, /* pattern pointer */ + *b; /* buffer pointer */ + +enum status_t +match (struct pattern *p, char *b) +@{ + enum status_t done = working; + + while (!(done = match_1_operator (p, b))) + @{ + struct pattern *p1 = p; + p = next_p (p, b); + b = next_b (p1, b); + @} + return done; +@} +@end group +@end example + +This format exposes the underlying finite automaton. + +All of them have the following structure, except that the @samp{next_*} +functions decide where to jump (for @samp{p}) and whether or not to +increment (for @samp{b}), rather than checking for satisfaction of a +matching condition. + +@example +enum status_t +match_1_operator (pattern *p, char *b) +@{ + if (! *b) return end_of_buffer; + switch (p->operator) + @{ + case accept: + return matched; + case exact: + if (*b != p->char_value) return mismatch; else break; + case any: + break; + case range: + /* range_table is computed in the regexp_compile function */ + if (! p->range_table[*b]) return mismatch; + /* etc, etc */ + @} + return working; +@} +@end example + +Grouping, repetition, and alternation are handled by compiling the +subexpression and calling @code{match (p->subpattern, b)} recursively. + +In terms of reading the actual code, there are five optimizations +(obfuscations, if you like) that have been done. + +@enumerate +@item +An explicit "failure stack" has been substituted for recursion. + +@item +The @code{match_1_operator}, @code{next_p}, and @code{next_b} functions +are actually inlined into the @code{match} function for efficiency. +Then the pointer movement is interspersed with the matching operations. + +@item +If the operator uses buffer context, the buffer pointer movement is +sometimes implicit in the operations retrieving the context. + +@item +Some cases are combined into short preparation for individual cases, and +a "fall-through" into combined code for several cases. + +@item +The @code{pattern} type is not an explicit @samp{struct}. Instead, the +data (including, @emph{e.g.}, @samp{range_table}) is inlined into the +compiled bytecode. This leads to bizarre code in the interpreter like + +@example +case range: + p += *(p + 1); break; +@end example + +in @code{next_p}, because the compiled pattern is laid out + +@example +..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ... +@end example +@end enumerate + +But if you keep your eye on the "switch in a loop" structure, you +should be able to understand the parts you need. + + @node MULE Character Sets and Encodings, The Lisp Reader and Compiler, Buffers and Textual Representation, Top @chapter MULE Character Sets and Encodings @cindex Mule character sets and encodings