Mercurial > hg > xemacs-beta
comparison man/lispref/searching.texi @ 255:084402c475ba r20-5b26
Import from CVS: tag r20-5b26
author | cvs |
---|---|
date | Mon, 13 Aug 2007 10:21:18 +0200 |
parents | 376386a54a3c |
children | 7df0dd720c89 |
comparison
equal
deleted
inserted
replaced
254:e92abcaa252b | 255:084402c475ba |
---|---|
159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that | 159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that |
160 denotes a (possibly infinite) set of strings. Searching for matches for | 160 denotes a (possibly infinite) set of strings. Searching for matches for |
161 a regexp is a very powerful operation. This section explains how to write | 161 a regexp is a very powerful operation. This section explains how to write |
162 regexps; the following section says how to search for them. | 162 regexps; the following section says how to search for them. |
163 | 163 |
164 To gain a thorough understanding of regular expressions and how to use | |
165 them to best advantage, we recommend that you study @cite{Mastering | |
166 Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates, | |
167 1997}. (It's known as the "Hip Owls" book, because of the picture on its | |
168 cover.) You might also read the manuals to @ref{(gawk)Top}, | |
169 @ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top}, | |
170 @ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}, which | |
171 also make good use of regular expressions. | |
172 | |
173 The XEmacs regular expression syntax most closely resembles that of | |
174 @cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU | |
175 @cite{regex} library. XEmacs' version of @cite{regex} has recently been | |
176 extended with some perl--like capabilities, described in the next | |
177 section. | |
178 | |
164 @menu | 179 @menu |
165 * Syntax of Regexps:: Rules for writing regular expressions. | 180 * Syntax of Regexps:: Rules for writing regular expressions. |
166 * Regexp Example:: Illustrates regular expression syntax. | 181 * Regexp Example:: Illustrates regular expression syntax. |
167 @end menu | 182 @end menu |
168 | 183 |
203 matches any three-character string that begins with @samp{a} and ends with | 218 matches any three-character string that begins with @samp{a} and ends with |
204 @samp{b}.@refill | 219 @samp{b}.@refill |
205 | 220 |
206 @item * | 221 @item * |
207 @cindex @samp{*} in regexp | 222 @cindex @samp{*} in regexp |
208 is not a construct by itself; it is a suffix operator that means to | 223 is not a construct by itself; it is a quantifying suffix operator that |
209 repeat the preceding regular expression as many times as possible. In | 224 means to repeat the preceding regular expression as many times as |
210 @samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches | 225 possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so |
211 one @samp{f} followed by any number of @samp{o}s. The case of zero | 226 @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s. |
212 @samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill | 227 The case of zero @samp{o}s is allowed: @samp{fo*} does match |
228 @samp{f}.@refill | |
213 | 229 |
214 @samp{*} always applies to the @emph{smallest} possible preceding | 230 @samp{*} always applies to the @emph{smallest} possible preceding |
215 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a | 231 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a |
216 repeating @samp{fo}.@refill | 232 repeating @samp{fo}.@refill |
217 | 233 |
218 The matcher processes a @samp{*} construct by matching, immediately, | 234 The matcher processes a @samp{*} construct by matching, immediately, as |
219 as many repetitions as can be found. Then it continues with the rest | 235 many repetitions as can be found; it is "greedy". Then it continues |
220 of the pattern. If that fails, backtracking occurs, discarding some | 236 with the rest of the pattern. If that fails, backtracking occurs, |
221 of the matches of the @samp{*}-modified construct in case that makes | 237 discarding some of the matches of the @samp{*}-modified construct in |
222 it possible to match the rest of the pattern. For example, in matching | 238 case that makes it possible to match the rest of the pattern. For |
223 @samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first | 239 example, in matching @samp{ca*ar} against the string @samp{caaar}, the |
224 tries to match all three @samp{a}s; but the rest of the pattern is | 240 @samp{a*} first tries to match all three @samp{a}s; but the rest of the |
225 @samp{ar} and there is only @samp{r} left to match, so this try fails. | 241 pattern is @samp{ar} and there is only @samp{r} left to match, so this |
226 The next alternative is for @samp{a*} to match only two @samp{a}s. | 242 try fails. The next alternative is for @samp{a*} to match only two |
227 With this choice, the rest of the regexp matches successfully.@refill | 243 @samp{a}s. With this choice, the rest of the regexp matches |
244 successfully.@refill | |
228 | 245 |
229 Nested repetition operators can be extremely slow if they specify | 246 Nested repetition operators can be extremely slow if they specify |
230 backtracking loops. For example, it could take hours for the regular | 247 backtracking loops. For example, it could take hours for the regular |
231 expression @samp{\(x+y*\)*a} to match the sequence | 248 expression @samp{\(x+y*\)*a} to match the sequence |
232 @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}. The slowness is because | 249 @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}. The slowness is because |
234 concluding that none of them can work. To make sure your regular | 251 concluding that none of them can work. To make sure your regular |
235 expressions run fast, check nested repetitions carefully. | 252 expressions run fast, check nested repetitions carefully. |
236 | 253 |
237 @item + | 254 @item + |
238 @cindex @samp{+} in regexp | 255 @cindex @samp{+} in regexp |
239 is a suffix operator similar to @samp{*} except that the preceding | 256 is a quantifying suffix operator similar to @samp{*} except that the |
240 expression must match at least once. So, for example, @samp{ca+r} | 257 preceding expression must match at least once. It is also "greedy". |
241 matches the strings @samp{car} and @samp{caaaar} but not the string | 258 So, for example, @samp{ca+r} matches the strings @samp{car} and |
242 @samp{cr}, whereas @samp{ca*r} matches all three strings. | 259 @samp{caaaar} but not the string @samp{cr}, whereas @samp{ca*r} matches |
260 all three strings. | |
243 | 261 |
244 @item ? | 262 @item ? |
245 @cindex @samp{?} in regexp | 263 @cindex @samp{?} in regexp |
246 is a suffix operator similar to @samp{*} except that the preceding | 264 is a quantifying suffix operator similar to @samp{*}, except that the |
247 expression can match either once or not at all. For example, | 265 preceding expression can match either once or not at all. For example, |
248 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing | 266 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing |
249 else. | 267 else. |
268 | |
269 @item *? | |
270 @cindex @samp{*?} in regexp | |
271 works just like @samp{*}, except that rather than matching the longest | |
272 match, it matches the shortest match. This is known as a "non-greedy" | |
273 quantifier. It is a syntax that comes to us from perl. It is very | |
274 useful for situations where you want to match the text inside a pair of | |
275 delimiters. | |
276 @c Did perl get this from somewhere? What's the real history of *? ? | |
277 | |
278 @lisp | |
279 @group | |
280 (setq s "/ blah / / blah2 /") | |
281 @result{} "/ blah / / blah2 /" | |
282 (string-match "/.*/" s) | |
283 @result{} 0 | |
284 (match-string 0 s) | |
285 @result{} "/ blah / / blah2 /" | |
286 (string-match "/.*?/" s) | |
287 @result{} 0 | |
288 (match-string 0 s) | |
289 @result{} "/ blah /" | |
290 @end group | |
291 @end lisp | |
292 | |
293 @item +? | |
294 @cindex @samp{+?} in regexp | |
295 is the @samp{+} analog to @samp{*?}. | |
296 | |
297 @item \@{n,m\@} | |
298 @c Note the spacing after the close brace is deliberate. | |
299 @cindex @samp{\@{n,m\@} }in regexp | |
300 this is an interval quantifier, which is analogous to @samp{*} or | |
301 @samp{+}, but specifies that the expression must match at least @samp{n} | |
302 times, but no more than @samp{m} times. This syntax comes to us from | |
303 @cite{ed}, @cite{grep}, and @cite{perl}. The @cite{etags} utility also | |
304 supports it. | |
305 | |
306 @lisp | |
307 @group | |
308 (setq s "12 123 1234 12345") | |
309 @result{} "12 123 1234 12345" | |
310 (string-match "[0-9]\\@{2,4\\@}" s) | |
311 @result{} 0 | |
312 (match-string 0 s) | |
313 @result{} "12" | |
314 (string-match "[0-9]\\@{3,4\\@}" s) | |
315 @result{} 3 | |
316 (match-string 0 s) | |
317 @result{} "123" | |
318 @end group | |
319 @end lisp | |
250 | 320 |
251 @item [ @dots{} ] | 321 @item [ @dots{} ] |
252 @cindex character set (in regexp) | 322 @cindex character set (in regexp) |
253 @cindex @samp{[} in regexp | 323 @cindex @samp{[} in regexp |
254 @cindex @samp{]} in regexp | 324 @cindex @samp{]} in regexp |
406 | 476 |
407 For example, @samp{\(.*\)\1} matches any newline-free string that is | 477 For example, @samp{\(.*\)\1} matches any newline-free string that is |
408 composed of two identical halves. The @samp{\(.*\)} matches the first | 478 composed of two identical halves. The @samp{\(.*\)} matches the first |
409 half, which may be anything, but the @samp{\1} that follows must match | 479 half, which may be anything, but the @samp{\1} that follows must match |
410 the same exact text. | 480 the same exact text. |
481 | |
482 @item \(?: @dots{} \) | |
483 @cindex @samp{(?:} in regex | |
484 @cindex regexp grouping | |
485 is called a "shy" grouping operator, and it is used just like @samp{\( | |
486 @dots{} \)}, except that it does not cause the matched substring to be | |
487 recorded for future reference. This can be useful at times when a | |
488 program wants to refer to a specific @samp{\( @dots{} \)} group's number | |
489 (eg. in a @code{match-string} or @code{match-beginning} function | |
490 application) and you need to use grouping constructs for an alternation | |
491 or multi--character repetition inside a regular expression string that | |
492 can change each time the code is run, but you don't want those groups | |
493 counting because they'd change the reference number of the group you | |
494 want to refer to that is inside the static part of your generated | |
495 regular expression. | |
496 | |
497 @lisp | |
498 ;; @r{Here `dynamic-regex' might contain shy groups.} | |
499 (re-search-forward | |
500 (concat "\\(" dynamic-regex "\\)\\(-?[0-9]\\@{2,4\\@}\\)")) | |
501 ;; @r{and this `match-string' will still refer to the integer} | |
502 ;; @r{captured by the second group in the `concat' string.} | |
503 (match-string 2) | |
504 @end lisp | |
505 | |
506 Using @samp{\(?: @dots{} \)} rather than @samp{\( @dots{} \)} when you | |
507 don't need the captured substrings ought to speed up your programs some, | |
508 since it shortens the code path followed by the regular expression | |
509 engine, as well as the amount of memory allocation and string copying it | |
510 must do. The actual performance gain to be observed has not been | |
511 measured or quantified as of this writing. | |
512 @c This is used to good advantage by the font-locking code, and by `regexp-opt.el'. | |
513 @c ... It will be. It's not yet, but will be. | |
411 | 514 |
412 @item \w | 515 @item \w |
413 @cindex @samp{\w} in regexp | 516 @cindex @samp{\w} in regexp |
414 matches any word-constituent character. The editor syntax table | 517 matches any word-constituent character. The editor syntax table |
415 determines which characters these are. @xref{Syntax Tables}. | 518 determines which characters these are. @xref{Syntax Tables}. |