Mercurial > hg > xemacs-beta
diff man/lispref/searching.texi @ 255:084402c475ba r20-5b26
Import from CVS: tag r20-5b26
author | cvs |
---|---|
date | Mon, 13 Aug 2007 10:21:18 +0200 |
parents | 376386a54a3c |
children | 7df0dd720c89 |
line wrap: on
line diff
--- a/man/lispref/searching.texi Mon Aug 13 10:20:29 2007 +0200 +++ b/man/lispref/searching.texi Mon Aug 13 10:21:18 2007 +0200 @@ -161,6 +161,21 @@ a regexp is a very powerful operation. This section explains how to write regexps; the following section says how to search for them. + To gain a thorough understanding of regular expressions and how to use +them to best advantage, we recommend that you study @cite{Mastering +Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates, +1997}. (It's known as the "Hip Owls" book, because of the picture on its +cover.) You might also read the manuals to @ref{(gawk)Top}, +@ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top}, +@ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}, which +also make good use of regular expressions. + + The XEmacs regular expression syntax most closely resembles that of +@cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU +@cite{regex} library. XEmacs' version of @cite{regex} has recently been +extended with some perl--like capabilities, described in the next +section. + @menu * Syntax of Regexps:: Rules for writing regular expressions. * Regexp Example:: Illustrates regular expression syntax. @@ -205,26 +220,28 @@ @item * @cindex @samp{*} in regexp -is not a construct by itself; it is a suffix operator that means to -repeat the preceding regular expression as many times as possible. In -@samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches -one @samp{f} followed by any number of @samp{o}s. The case of zero -@samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill +is not a construct by itself; it is a quantifying suffix operator that +means to repeat the preceding regular expression as many times as +possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so +@samp{fo*} matches one @samp{f} followed by any number of @samp{o}s. +The case of zero @samp{o}s is allowed: @samp{fo*} does match +@samp{f}.@refill @samp{*} always applies to the @emph{smallest} possible preceding expression. Thus, @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.@refill -The matcher processes a @samp{*} construct by matching, immediately, -as many repetitions as can be found. Then it continues with the rest -of the pattern. If that fails, backtracking occurs, discarding some -of the matches of the @samp{*}-modified construct in case that makes -it possible to match the rest of the pattern. For example, in matching -@samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first -tries to match all three @samp{a}s; but the rest of the pattern is -@samp{ar} and there is only @samp{r} left to match, so this try fails. -The next alternative is for @samp{a*} to match only two @samp{a}s. -With this choice, the rest of the regexp matches successfully.@refill +The matcher processes a @samp{*} construct by matching, immediately, as +many repetitions as can be found; it is "greedy". Then it continues +with the rest of the pattern. If that fails, backtracking occurs, +discarding some of the matches of the @samp{*}-modified construct in +case that makes it possible to match the rest of the pattern. For +example, in matching @samp{ca*ar} against the string @samp{caaar}, the +@samp{a*} first tries to match all three @samp{a}s; but the rest of the +pattern is @samp{ar} and there is only @samp{r} left to match, so this +try fails. The next alternative is for @samp{a*} to match only two +@samp{a}s. With this choice, the rest of the regexp matches +successfully.@refill Nested repetition operators can be extremely slow if they specify backtracking loops. For example, it could take hours for the regular @@ -236,18 +253,71 @@ @item + @cindex @samp{+} in regexp -is a suffix operator similar to @samp{*} except that the preceding -expression must match at least once. So, for example, @samp{ca+r} -matches the strings @samp{car} and @samp{caaaar} but not the string -@samp{cr}, whereas @samp{ca*r} matches all three strings. +is a quantifying suffix operator similar to @samp{*} except that the +preceding expression must match at least once. It is also "greedy". +So, for example, @samp{ca+r} matches the strings @samp{car} and +@samp{caaaar} but not the string @samp{cr}, whereas @samp{ca*r} matches +all three strings. @item ? @cindex @samp{?} in regexp -is a suffix operator similar to @samp{*} except that the preceding -expression can match either once or not at all. For example, +is a quantifying suffix operator similar to @samp{*}, except that the +preceding expression can match either once or not at all. For example, @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing else. +@item *? +@cindex @samp{*?} in regexp +works just like @samp{*}, except that rather than matching the longest +match, it matches the shortest match. This is known as a "non-greedy" +quantifier. It is a syntax that comes to us from perl. It is very +useful for situations where you want to match the text inside a pair of +delimiters. +@c Did perl get this from somewhere? What's the real history of *? ? + +@lisp +@group +(setq s "/ blah / / blah2 /") + @result{} "/ blah / / blah2 /" +(string-match "/.*/" s) + @result{} 0 +(match-string 0 s) + @result{} "/ blah / / blah2 /" +(string-match "/.*?/" s) + @result{} 0 +(match-string 0 s) + @result{} "/ blah /" +@end group +@end lisp + +@item +? +@cindex @samp{+?} in regexp +is the @samp{+} analog to @samp{*?}. + +@item \@{n,m\@} +@c Note the spacing after the close brace is deliberate. +@cindex @samp{\@{n,m\@} }in regexp +this is an interval quantifier, which is analogous to @samp{*} or +@samp{+}, but specifies that the expression must match at least @samp{n} +times, but no more than @samp{m} times. This syntax comes to us from +@cite{ed}, @cite{grep}, and @cite{perl}. The @cite{etags} utility also +supports it. + +@lisp +@group +(setq s "12 123 1234 12345") + @result{} "12 123 1234 12345" +(string-match "[0-9]\\@{2,4\\@}" s) + @result{} 0 +(match-string 0 s) + @result{} "12" +(string-match "[0-9]\\@{3,4\\@}" s) + @result{} 3 +(match-string 0 s) + @result{} "123" +@end group +@end lisp + @item [ @dots{} ] @cindex character set (in regexp) @cindex @samp{[} in regexp @@ -409,6 +479,39 @@ half, which may be anything, but the @samp{\1} that follows must match the same exact text. +@item \(?: @dots{} \) +@cindex @samp{(?:} in regex +@cindex regexp grouping +is called a "shy" grouping operator, and it is used just like @samp{\( +@dots{} \)}, except that it does not cause the matched substring to be +recorded for future reference. This can be useful at times when a +program wants to refer to a specific @samp{\( @dots{} \)} group's number +(eg. in a @code{match-string} or @code{match-beginning} function +application) and you need to use grouping constructs for an alternation +or multi--character repetition inside a regular expression string that +can change each time the code is run, but you don't want those groups +counting because they'd change the reference number of the group you +want to refer to that is inside the static part of your generated +regular expression. + +@lisp +;; @r{Here `dynamic-regex' might contain shy groups.} +(re-search-forward + (concat "\\(" dynamic-regex "\\)\\(-?[0-9]\\@{2,4\\@}\\)")) +;; @r{and this `match-string' will still refer to the integer} +;; @r{captured by the second group in the `concat' string.} +(match-string 2) +@end lisp + +Using @samp{\(?: @dots{} \)} rather than @samp{\( @dots{} \)} when you +don't need the captured substrings ought to speed up your programs some, +since it shortens the code path followed by the regular expression +engine, as well as the amount of memory allocation and string copying it +must do. The actual performance gain to be observed has not been +measured or quantified as of this writing. +@c This is used to good advantage by the font-locking code, and by `regexp-opt.el'. +@c ... It will be. It's not yet, but will be. + @item \w @cindex @samp{\w} in regexp matches any word-constituent character. The editor syntax table