comparison man/lispref/searching.texi @ 255:084402c475ba r20-5b26

Import from CVS: tag r20-5b26
author cvs
date Mon, 13 Aug 2007 10:21:18 +0200
parents 376386a54a3c
children 7df0dd720c89
comparison
equal deleted inserted replaced
254:e92abcaa252b 255:084402c475ba
159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that 159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that
160 denotes a (possibly infinite) set of strings. Searching for matches for 160 denotes a (possibly infinite) set of strings. Searching for matches for
161 a regexp is a very powerful operation. This section explains how to write 161 a regexp is a very powerful operation. This section explains how to write
162 regexps; the following section says how to search for them. 162 regexps; the following section says how to search for them.
163 163
164 To gain a thorough understanding of regular expressions and how to use
165 them to best advantage, we recommend that you study @cite{Mastering
166 Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates,
167 1997}. (It's known as the "Hip Owls" book, because of the picture on its
168 cover.) You might also read the manuals to @ref{(gawk)Top},
169 @ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top},
170 @ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}, which
171 also make good use of regular expressions.
172
173 The XEmacs regular expression syntax most closely resembles that of
174 @cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU
175 @cite{regex} library. XEmacs' version of @cite{regex} has recently been
176 extended with some perl--like capabilities, described in the next
177 section.
178
164 @menu 179 @menu
165 * Syntax of Regexps:: Rules for writing regular expressions. 180 * Syntax of Regexps:: Rules for writing regular expressions.
166 * Regexp Example:: Illustrates regular expression syntax. 181 * Regexp Example:: Illustrates regular expression syntax.
167 @end menu 182 @end menu
168 183
203 matches any three-character string that begins with @samp{a} and ends with 218 matches any three-character string that begins with @samp{a} and ends with
204 @samp{b}.@refill 219 @samp{b}.@refill
205 220
206 @item * 221 @item *
207 @cindex @samp{*} in regexp 222 @cindex @samp{*} in regexp
208 is not a construct by itself; it is a suffix operator that means to 223 is not a construct by itself; it is a quantifying suffix operator that
209 repeat the preceding regular expression as many times as possible. In 224 means to repeat the preceding regular expression as many times as
210 @samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches 225 possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so
211 one @samp{f} followed by any number of @samp{o}s. The case of zero 226 @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s.
212 @samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill 227 The case of zero @samp{o}s is allowed: @samp{fo*} does match
228 @samp{f}.@refill
213 229
214 @samp{*} always applies to the @emph{smallest} possible preceding 230 @samp{*} always applies to the @emph{smallest} possible preceding
215 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a 231 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a
216 repeating @samp{fo}.@refill 232 repeating @samp{fo}.@refill
217 233
218 The matcher processes a @samp{*} construct by matching, immediately, 234 The matcher processes a @samp{*} construct by matching, immediately, as
219 as many repetitions as can be found. Then it continues with the rest 235 many repetitions as can be found; it is "greedy". Then it continues
220 of the pattern. If that fails, backtracking occurs, discarding some 236 with the rest of the pattern. If that fails, backtracking occurs,
221 of the matches of the @samp{*}-modified construct in case that makes 237 discarding some of the matches of the @samp{*}-modified construct in
222 it possible to match the rest of the pattern. For example, in matching 238 case that makes it possible to match the rest of the pattern. For
223 @samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first 239 example, in matching @samp{ca*ar} against the string @samp{caaar}, the
224 tries to match all three @samp{a}s; but the rest of the pattern is 240 @samp{a*} first tries to match all three @samp{a}s; but the rest of the
225 @samp{ar} and there is only @samp{r} left to match, so this try fails. 241 pattern is @samp{ar} and there is only @samp{r} left to match, so this
226 The next alternative is for @samp{a*} to match only two @samp{a}s. 242 try fails. The next alternative is for @samp{a*} to match only two
227 With this choice, the rest of the regexp matches successfully.@refill 243 @samp{a}s. With this choice, the rest of the regexp matches
244 successfully.@refill
228 245
229 Nested repetition operators can be extremely slow if they specify 246 Nested repetition operators can be extremely slow if they specify
230 backtracking loops. For example, it could take hours for the regular 247 backtracking loops. For example, it could take hours for the regular
231 expression @samp{\(x+y*\)*a} to match the sequence 248 expression @samp{\(x+y*\)*a} to match the sequence
232 @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}. The slowness is because 249 @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}. The slowness is because
234 concluding that none of them can work. To make sure your regular 251 concluding that none of them can work. To make sure your regular
235 expressions run fast, check nested repetitions carefully. 252 expressions run fast, check nested repetitions carefully.
236 253
237 @item + 254 @item +
238 @cindex @samp{+} in regexp 255 @cindex @samp{+} in regexp
239 is a suffix operator similar to @samp{*} except that the preceding 256 is a quantifying suffix operator similar to @samp{*} except that the
240 expression must match at least once. So, for example, @samp{ca+r} 257 preceding expression must match at least once. It is also "greedy".
241 matches the strings @samp{car} and @samp{caaaar} but not the string 258 So, for example, @samp{ca+r} matches the strings @samp{car} and
242 @samp{cr}, whereas @samp{ca*r} matches all three strings. 259 @samp{caaaar} but not the string @samp{cr}, whereas @samp{ca*r} matches
260 all three strings.
243 261
244 @item ? 262 @item ?
245 @cindex @samp{?} in regexp 263 @cindex @samp{?} in regexp
246 is a suffix operator similar to @samp{*} except that the preceding 264 is a quantifying suffix operator similar to @samp{*}, except that the
247 expression can match either once or not at all. For example, 265 preceding expression can match either once or not at all. For example,
248 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing 266 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing
249 else. 267 else.
268
269 @item *?
270 @cindex @samp{*?} in regexp
271 works just like @samp{*}, except that rather than matching the longest
272 match, it matches the shortest match. This is known as a "non-greedy"
273 quantifier. It is a syntax that comes to us from perl. It is very
274 useful for situations where you want to match the text inside a pair of
275 delimiters.
276 @c Did perl get this from somewhere? What's the real history of *? ?
277
278 @lisp
279 @group
280 (setq s "/ blah / / blah2 /")
281 @result{} "/ blah / / blah2 /"
282 (string-match "/.*/" s)
283 @result{} 0
284 (match-string 0 s)
285 @result{} "/ blah / / blah2 /"
286 (string-match "/.*?/" s)
287 @result{} 0
288 (match-string 0 s)
289 @result{} "/ blah /"
290 @end group
291 @end lisp
292
293 @item +?
294 @cindex @samp{+?} in regexp
295 is the @samp{+} analog to @samp{*?}.
296
297 @item \@{n,m\@}
298 @c Note the spacing after the close brace is deliberate.
299 @cindex @samp{\@{n,m\@} }in regexp
300 this is an interval quantifier, which is analogous to @samp{*} or
301 @samp{+}, but specifies that the expression must match at least @samp{n}
302 times, but no more than @samp{m} times. This syntax comes to us from
303 @cite{ed}, @cite{grep}, and @cite{perl}. The @cite{etags} utility also
304 supports it.
305
306 @lisp
307 @group
308 (setq s "12 123 1234 12345")
309 @result{} "12 123 1234 12345"
310 (string-match "[0-9]\\@{2,4\\@}" s)
311 @result{} 0
312 (match-string 0 s)
313 @result{} "12"
314 (string-match "[0-9]\\@{3,4\\@}" s)
315 @result{} 3
316 (match-string 0 s)
317 @result{} "123"
318 @end group
319 @end lisp
250 320
251 @item [ @dots{} ] 321 @item [ @dots{} ]
252 @cindex character set (in regexp) 322 @cindex character set (in regexp)
253 @cindex @samp{[} in regexp 323 @cindex @samp{[} in regexp
254 @cindex @samp{]} in regexp 324 @cindex @samp{]} in regexp
406 476
407 For example, @samp{\(.*\)\1} matches any newline-free string that is 477 For example, @samp{\(.*\)\1} matches any newline-free string that is
408 composed of two identical halves. The @samp{\(.*\)} matches the first 478 composed of two identical halves. The @samp{\(.*\)} matches the first
409 half, which may be anything, but the @samp{\1} that follows must match 479 half, which may be anything, but the @samp{\1} that follows must match
410 the same exact text. 480 the same exact text.
481
482 @item \(?: @dots{} \)
483 @cindex @samp{(?:} in regex
484 @cindex regexp grouping
485 is called a "shy" grouping operator, and it is used just like @samp{\(
486 @dots{} \)}, except that it does not cause the matched substring to be
487 recorded for future reference. This can be useful at times when a
488 program wants to refer to a specific @samp{\( @dots{} \)} group's number
489 (eg. in a @code{match-string} or @code{match-beginning} function
490 application) and you need to use grouping constructs for an alternation
491 or multi--character repetition inside a regular expression string that
492 can change each time the code is run, but you don't want those groups
493 counting because they'd change the reference number of the group you
494 want to refer to that is inside the static part of your generated
495 regular expression.
496
497 @lisp
498 ;; @r{Here `dynamic-regex' might contain shy groups.}
499 (re-search-forward
500 (concat "\\(" dynamic-regex "\\)\\(-?[0-9]\\@{2,4\\@}\\)"))
501 ;; @r{and this `match-string' will still refer to the integer}
502 ;; @r{captured by the second group in the `concat' string.}
503 (match-string 2)
504 @end lisp
505
506 Using @samp{\(?: @dots{} \)} rather than @samp{\( @dots{} \)} when you
507 don't need the captured substrings ought to speed up your programs some,
508 since it shortens the code path followed by the regular expression
509 engine, as well as the amount of memory allocation and string copying it
510 must do. The actual performance gain to be observed has not been
511 measured or quantified as of this writing.
512 @c This is used to good advantage by the font-locking code, and by `regexp-opt.el'.
513 @c ... It will be. It's not yet, but will be.
411 514
412 @item \w 515 @item \w
413 @cindex @samp{\w} in regexp 516 @cindex @samp{\w} in regexp
414 matches any word-constituent character. The editor syntax table 517 matches any word-constituent character. The editor syntax table
415 determines which characters these are. @xref{Syntax Tables}. 518 determines which characters these are. @xref{Syntax Tables}.