comparison man/lispref/searching.texi @ 314:341dac730539 r21-0b55

Import from CVS: tag r21-0b55
author cvs
date Mon, 13 Aug 2007 10:44:22 +0200
parents 70ad99077275
children 512e409c26a2
comparison
equal deleted inserted replaced
313:2905de29931f 314:341dac730539
157 @cindex regexp 157 @cindex regexp
158 158
159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that 159 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that
160 denotes a (possibly infinite) set of strings. Searching for matches for 160 denotes a (possibly infinite) set of strings. Searching for matches for
161 a regexp is a very powerful operation. This section explains how to write 161 a regexp is a very powerful operation. This section explains how to write
162 regexps; the following section says how to search for them. 162 regexps; the following section says how to search using them.
163 163
164 To gain a thorough understanding of regular expressions and how to use 164 To gain a thorough understanding of regular expressions and how to use
165 them to best advantage, we recommend that you study @cite{Mastering 165 them to best advantage, we recommend that you study @cite{Mastering
166 Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates, 166 Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates,
167 1997}. (It's known as the "Hip Owls" book, because of the picture on its 167 1997}. (It's known as the "Hip Owls" book, because of the picture on its
168 cover.) You might also read the manuals to @ref{(gawk)Top}, 168 cover.) You might also read the manuals to @ref{(gawk)Top},
169 @ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top}, 169 @ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top},
170 @ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}, which 170 @ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}. All
171 also make good use of regular expressions. 171 of these programs and libraries make effective use of regular
172 expressions.
172 173
173 The XEmacs regular expression syntax most closely resembles that of 174 The XEmacs regular expression syntax most closely resembles that of
174 @cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU 175 @cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU
175 @cite{regex} library. XEmacs' version of @cite{regex} has recently been 176 @cite{regex} library. XEmacs' version of @cite{regex} has recently been
176 extended with some Perl--like capabilities, described in the next 177 extended with some Perl--like capabilities, which are described in the
177 section. 178 next section.
178 179
179 @menu 180 @menu
180 * Syntax of Regexps:: Rules for writing regular expressions. 181 * Syntax of Regexps:: Rules for writing regular expressions.
181 * Regexp Example:: Illustrates regular expression syntax. 182 * Regexp Example:: Illustrates regular expression syntax.
182 @end menu 183 @end menu
261 262
262 @item ? 263 @item ?
263 @cindex @samp{?} in regexp 264 @cindex @samp{?} in regexp
264 is a quantifying suffix operator similar to @samp{*}, except that the 265 is a quantifying suffix operator similar to @samp{*}, except that the
265 preceding expression can match either once or not at all. For example, 266 preceding expression can match either once or not at all. For example,
266 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing 267 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anything
267 else. 268 else.
268 269
269 @item *? 270 @item *?
270 @cindex @samp{*?} in regexp 271 @cindex @samp{*?} in regexp
271 works just like @samp{*}, except that rather than matching the longest 272 works just like @samp{*}, except that rather than matching the longest
272 match, it matches the shortest match. @samp{*?} is known as a 273 match, it matches the shortest match. @samp{*?} is known as a
273 @dfn{non-greedy} quantifier, a regexp construct borrowed from Perl. 274 @dfn{non-greedy} quantifier, a regexp construct borrowed from Perl.
274 @c Did perl get this from somewhere? What's the real history of *? ? 275 @c Did perl get this from somewhere? What's the real history of *? ?
275 276
276 This construct very useful for when you want to match the text inside a 277 This construct is very useful for when you want to match the text inside
277 pair of delimiters. For instance, @samp{/\*.*?\*/} will match C 278 a pair of delimiters. For instance, @samp{/\*.*?\*/} will match C
278 comments in a string. This could not be achieved without the use of 279 comments in a string. This could not be so elegantly achieved without
279 greedy quantifier. 280 the use of a nongreedy quantifier.
280 281
281 This construct has not been available prior to XEmacs 20.4. It is not 282 This construct has not been available prior to XEmacs 20.4. It is not
282 available in FSF Emacs. 283 available in FSF Emacs.
283 284
284 @item +? 285 @item +?
453 composed of two identical halves. The @samp{\(.*\)} matches the first 454 composed of two identical halves. The @samp{\(.*\)} matches the first
454 half, which may be anything, but the @samp{\1} that follows must match 455 half, which may be anything, but the @samp{\1} that follows must match
455 the same exact text. 456 the same exact text.
456 457
457 @item \(?: @dots{} \) 458 @item \(?: @dots{} \)
458 @cindex @samp{\(?:} in regexp 459 @cindex @samp{(?:} in regexp
459 @cindex regexp grouping 460 @cindex regexp grouping
460 is called a @dfn{shy} grouping operator, and it is used just like 461 is called a @dfn{shy} grouping operator, and it is used just like
461 @samp{\( @dots{} \)}, except that it does not cause the matched 462 @samp{\( @dots{} \)}, except that it does not cause the match
462 substring to be recorded for future reference. 463 substring to be recorded for future reference.
463 464
464 This is useful when you need a lot of grouping @samp{\( @dots{} \)} 465 This is useful when you need to use a lot of nested grouping @samp{\(
465 constructs, but only want to remember one or two. Then you can use 466 @dots{} \)} constructs to express complex alternation, but only want to
466 not want to remember them for later use with @code{match-string}. 467 memoize, or capture, one or two of the subexpression matches. Since
467 468 @samp{\(?: @dots{} \)} doesn't capture a submatch, it also doesn't need
468 Using @samp{\(?: @dots{} \)} rather than @samp{\( @dots{} \)} when you 469 to be counted when you count @samp{\( @dots{} \)} groups to figure the
469 don't need the captured substrings ought to speed up your programs some, 470 @samp{match-string} index. That turns out to be a very convenient
470 since it shortens the code path followed by the regular expression 471 characteristic.
471 engine, as well as the amount of memory allocation and string copying it 472
472 must do. The actual performance gain to be observed has not been 473 This situtation occurs where parts of a regular expression have been
473 measured or quantified as of this writing. 474 automaticly generated by a program that builds them from lists of
474 @c This is used to good advantage by the font-locking code, and by 475 strings, and the static code following the matching operation must
475 @c `regexp-opt.el'. ... It will be. It's not yet, but will be. 476 access a specific match number. Here's an example that shows this:
477
478 @example
479 @group
480 ;; Assume that:
481 (require 'regexp-opt) ;; gets executed at toplevel
482 ;;; `regexp-opt.el' is part of the "xemacs-devel" package.
483 ;; ... and that VARNAMES is a list of strings holding the name of some
484 ;; variables extracted from the program source you are editting and
485 ;; running this function on. For this example, it will just be bound
486 ;; in the let* expression.
487 (let* ((varnames '("k" "n" "i" "j" "varname"))
488 (keys-regexp (regexp-opt
489 (mapcar #'symbol-name
490 '(if then else elif
491 case in of do while
492 with for next unless
493 cond begin end))))
494 (varname-regexp (regexp-opt varnames))
495 (contrived-regexp (concat "\\(" keys-regexp "\\)"
496 "\\s-(\\s-\\("
497 varname-regexp
498 "\\)\\s-)"))
499 (keyname "")
500 (varname ""))
501 ;; In the body of this particular defun, we:
502 (re-search-forward contrived-regexp nil t)
503 ;; ... and it finds a match. Now we want to extract the text that
504 ;; it matched on, and save it into KEYNAME and VARNAME.
505 (setq keyname (match-string 1)
506 varname (match-string 2))
507 ;; ... and then do something with those values.
508 (list keyname varname))
509
510 ;; Here's something for it to match, so you can try it with `C-x C-e'.
511 ;; while ( j ) do ...
512 @end group
513 @end example
514
515 Here you can see that if the regular expression returned by
516 @samp{regexp-opt} did not use @samp{\(?: @dots{} \)} for grouping, and
517 instead used @samp{\( @dots{} \)}, it would be necessary to count the
518 number of opening parentheses in the @samp{keys-regexp} and to use that
519 figure to calculate which match number is matched by the
520 @code{varname-regexp}. It is much more convienient to be able to just
521 ask for the second match string.
522
523 @c This is used to good advantage by the font-locking code....
524 @c ... It will be. It's not yet, but will be.
476 525
477 The shy grouping operator has been borrowed from Perl, and has not been 526 The shy grouping operator has been borrowed from Perl, and has not been
478 available prior to XEmacs 20.3, nor is it available in FSF Emacs. 527 available prior to XEmacs 20.3, nor is it available in FSF Emacs.
479 528
480 @item \w 529 @item \w