Mercurial > hg > xemacs-beta
comparison man/xemacs/search.texi @ 442:abe6d1db359e r21-2-36
Import from CVS: tag r21-2-36
author | cvs |
---|---|
date | Mon, 13 Aug 2007 11:35:02 +0200 |
parents | 376386a54a3c |
children |
comparison
equal
deleted
inserted
replaced
441:72a7cfa4a488 | 442:abe6d1db359e |
---|---|
75 only if the next command you want to type is a printing character, | 75 only if the next command you want to type is a printing character, |
76 @key{DEL}, @key{ESC}, or another control character that is special | 76 @key{DEL}, @key{ESC}, or another control character that is special |
77 within searches (@kbd{C-q}, @kbd{C-w}, @kbd{C-r}, @kbd{C-s}, or @kbd{C-y}). | 77 within searches (@kbd{C-q}, @kbd{C-w}, @kbd{C-r}, @kbd{C-s}, or @kbd{C-y}). |
78 | 78 |
79 Sometimes you search for @samp{FOO} and find it, but were actually | 79 Sometimes you search for @samp{FOO} and find it, but were actually |
80 looking for a different occurance of it. To move to the next occurrence | 80 looking for a different occurrence of it. To move to the next occurrence |
81 of the search string, type another @kbd{C-s}. Do this as often as | 81 of the search string, type another @kbd{C-s}. Do this as often as |
82 necessary. If you overshoot, you can cancel some @kbd{C-s} | 82 necessary. If you overshoot, you can cancel some @kbd{C-s} |
83 characters with @key{DEL}. | 83 characters with @key{DEL}. |
84 | 84 |
85 After you exit a search, you can search for the same string again by | 85 After you exit a search, you can search for the same string again by |
328 @section Regular Expression Search | 328 @section Regular Expression Search |
329 @cindex regular expression | 329 @cindex regular expression |
330 @cindex regexp | 330 @cindex regexp |
331 | 331 |
332 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that | 332 A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that |
333 denotes a set of strings, possibly an infinite set. Searching for matches | 333 denotes a (possibly infinite) set of strings. Searching for matches |
334 for a regexp is a powerful operation that editors on Unix systems have | 334 for a regexp is a powerful operation that editors on Unix systems have |
335 traditionally offered. In XEmacs, you can search for the next match for | 335 traditionally offered. |
336 a regexp either incrementally or not. | 336 |
337 To gain a thorough understanding of regular expressions and how to use | |
338 them to best advantage, we recommend that you study @cite{Mastering | |
339 Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly and Associates, | |
340 1997}. (It's known as the "Hip Owls" book, because of the picture on its | |
341 cover.) You might also read the manuals to @ref{(gawk)Top}, | |
342 @ref{(ed)Top}, @cite{sed}, @cite{grep}, @ref{(perl)Top}, | |
343 @ref{(regex)Top}, @ref{(rx)Top}, @cite{pcre}, and @ref{(flex)Top}, which | |
344 also make good use of regular expressions. | |
345 | |
346 The XEmacs regular expression syntax most closely resembles that of | |
347 @cite{ed}, or @cite{grep}, the GNU versions of which all utilize the GNU | |
348 @cite{regex} library. XEmacs' version of @cite{regex} has recently been | |
349 extended with some Perl--like capabilities, described in the next | |
350 section. | |
351 | |
352 In XEmacs, you can search for the next match for a regexp either | |
353 incrementally or not. | |
337 | 354 |
338 @kindex M-C-s | 355 @kindex M-C-s |
356 @kindex M-C-r | |
339 @findex isearch-forward-regexp | 357 @findex isearch-forward-regexp |
340 @findex isearch-backward-regexp | 358 @findex isearch-backward-regexp |
341 Incremental search for a regexp is done by typing @kbd{M-C-s} | 359 Incremental search for a regexp is done by typing @kbd{M-C-s} |
342 (@code{isearch-forward-regexp}). This command reads a search string | 360 (@code{isearch-forward-regexp}). This command reads a search string |
343 incrementally just like @kbd{C-s}, but it treats the search string as a | 361 incrementally just like @kbd{C-s}, but it treats the search string as a |
344 regexp rather than looking for an exact match against the text in the | 362 regexp rather than looking for an exact match against the text in the |
345 buffer. Each time you add text to the search string, you make the regexp | 363 buffer. Each time you add text to the search string, you make the regexp |
346 longer, and the new regexp is searched for. A reverse regexp search command | 364 longer, and the new regexp is searched for. A reverse regexp search command |
347 @code{isearch-backward-regexp} also exists, but no key runs it. | 365 @code{isearch-backward-regexp} also exists, bound to @kbd{M-C-r}. |
348 | 366 |
349 All of the control characters that do special things within an ordinary | 367 All of the control characters that do special things within an ordinary |
350 incremental search have the same functionality in incremental regexp search. | 368 incremental search have the same functionality in incremental regexp search. |
351 Typing @kbd{C-s} or @kbd{C-r} immediately after starting a search | 369 Typing @kbd{C-s} or @kbd{C-r} immediately after starting a search |
352 retrieves the last incremental search regexp used: | 370 retrieves the last incremental search regexp used: |
356 @findex re-search-backward | 374 @findex re-search-backward |
357 Non-incremental search for a regexp is done by the functions | 375 Non-incremental search for a regexp is done by the functions |
358 @code{re-search-forward} and @code{re-search-backward}. You can invoke | 376 @code{re-search-forward} and @code{re-search-backward}. You can invoke |
359 them with @kbd{M-x} or bind them to keys. You can also call | 377 them with @kbd{M-x} or bind them to keys. You can also call |
360 @code{re-search-forward} by way of incremental regexp search with | 378 @code{re-search-forward} by way of incremental regexp search with |
361 @kbd{M-C-s @key{RET}}. | 379 @kbd{M-C-s @key{RET}}; similarly for @code{re-search-backward} with |
380 @kbd{M-C-r @key{RET}}. | |
362 | 381 |
363 @node Regexps, Search Case, Regexp Search, Search | 382 @node Regexps, Search Case, Regexp Search, Search |
364 @section Syntax of Regular Expressions | 383 @section Syntax of Regular Expressions |
365 | 384 |
366 Regular expressions have a syntax in which a few characters are special | 385 Regular expressions have a syntax in which a few characters are |
367 constructs and the rest are @dfn{ordinary}. An ordinary character is a | 386 special constructs and the rest are @dfn{ordinary}. An ordinary |
368 simple regular expression which matches that character and nothing else. | 387 character is a simple regular expression that matches that character and |
369 The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*}, | 388 nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, |
370 @samp{+}, @samp{?}, @samp{[}, @samp{]} and @samp{\}; no new special | 389 @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new |
371 characters will be defined. Any other character appearing in a regular | 390 special characters will be defined in the future. Any other character |
372 expression is ordinary, unless a @samp{\} precedes it.@refill | 391 appearing in a regular expression is ordinary, unless a @samp{\} |
392 precedes it. | |
373 | 393 |
374 For example, @samp{f} is not a special character, so it is ordinary, and | 394 For example, @samp{f} is not a special character, so it is ordinary, and |
375 therefore @samp{f} is a regular expression that matches the string @samp{f} | 395 therefore @samp{f} is a regular expression that matches the string |
376 and no other string. (It does @i{not} match the string @samp{ff}.) Likewise, | 396 @samp{f} and no other string. (It does @emph{not} match the string |
377 @samp{o} is a regular expression that matches only @samp{o}.@refill | 397 @samp{ff}.) Likewise, @samp{o} is a regular expression that matches |
398 only @samp{o}.@refill | |
378 | 399 |
379 Any two regular expressions @var{a} and @var{b} can be concatenated. The | 400 Any two regular expressions @var{a} and @var{b} can be concatenated. The |
380 result is a regular expression which matches a string if @var{a} matches | 401 result is a regular expression that matches a string if @var{a} matches |
381 some amount of the beginning of that string and @var{b} matches the rest of | 402 some amount of the beginning of that string and @var{b} matches the rest of |
382 the string.@refill | 403 the string.@refill |
383 | 404 |
384 As a simple example, you can concatenate the regular expressions @samp{f} | 405 As a simple example, we can concatenate the regular expressions @samp{f} |
385 and @samp{o} to get the regular expression @samp{fo}, which matches only | 406 and @samp{o} to get the regular expression @samp{fo}, which matches only |
386 the string @samp{fo}. To do something nontrivial, you | 407 the string @samp{fo}. Still trivial. To do something more powerful, you |
387 need to use one of the following special characters: | 408 need to use one of the special characters. Here is a list of them: |
388 | 409 |
410 @need 1200 | |
389 @table @kbd | 411 @table @kbd |
390 @item .@: @r{(Period)} | 412 @item .@: @r{(Period)} |
413 @cindex @samp{.} in regexp | |
391 is a special character that matches any single character except a newline. | 414 is a special character that matches any single character except a newline. |
392 Using concatenation, you can make regular expressions like @samp{a.b}, which | 415 Using concatenation, we can make regular expressions like @samp{a.b}, which |
393 matches any three-character string which begins with @samp{a} and ends with | 416 matches any three-character string that begins with @samp{a} and ends with |
394 @samp{b}.@refill | 417 @samp{b}.@refill |
395 | 418 |
396 @item * | 419 @item * |
397 is not a construct by itself; it is a suffix, which means the | 420 @cindex @samp{*} in regexp |
398 preceding regular expression is to be repeated as many times as | 421 is not a construct by itself; it is a quantifying suffix operator that |
422 means to repeat the preceding regular expression as many times as | |
399 possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so | 423 possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so |
400 @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s. | 424 @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s. |
401 The case of zero @samp{o}s is allowed: @samp{fo*} does match | 425 The case of zero @samp{o}s is allowed: @samp{fo*} does match |
402 @samp{f}.@refill | 426 @samp{f}.@refill |
403 | 427 |
404 @samp{*} always applies to the @i{smallest} possible preceding | 428 @samp{*} always applies to the @emph{smallest} possible preceding |
405 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a | 429 expression. Thus, @samp{fo*} has a repeating @samp{o}, not a |
406 repeating @samp{fo}.@refill | 430 repeating @samp{fo}.@refill |
407 | 431 |
408 The matcher processes a @samp{*} construct by immediately matching | 432 The matcher processes a @samp{*} construct by matching, immediately, as |
409 as many repetitions as it can find. Then it continues with the rest | 433 many repetitions as can be found; it is "greedy". Then it continues |
410 of the pattern. If that fails, backtracking occurs, discarding some | 434 with the rest of the pattern. If that fails, backtracking occurs, |
411 of the matches of the @samp{*}-modified construct in case that makes | 435 discarding some of the matches of the @samp{*}-modified construct in |
412 it possible to match the rest of the pattern. For example, matching | 436 case that makes it possible to match the rest of the pattern. For |
413 @samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first | 437 example, in matching @samp{ca*ar} against the string @samp{caaar}, the |
414 tries to match all three @samp{a}s; but the rest of the pattern is | 438 @samp{a*} first tries to match all three @samp{a}s; but the rest of the |
415 @samp{ar} and there is only @samp{r} left to match, so this try fails. | 439 pattern is @samp{ar} and there is only @samp{r} left to match, so this |
416 The next alternative is for @samp{a*} to match only two @samp{a}s. | 440 try fails. The next alternative is for @samp{a*} to match only two |
417 With this choice, the rest of the regexp matches successfully.@refill | 441 @samp{a}s. With this choice, the rest of the regexp matches |
442 successfully.@refill | |
443 | |
444 Nested repetition operators can be extremely slow if they specify | |
445 backtracking loops. For example, it could take hours for the regular | |
446 expression @samp{\(x+y*\)*a} to match the sequence | |
447 @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}. The slowness is because | |
448 Emacs must try each imaginable way of grouping the 35 @samp{x}'s before | |
449 concluding that none of them can work. To make sure your regular | |
450 expressions run fast, check nested repetitions carefully. | |
418 | 451 |
419 @item + | 452 @item + |
420 is a suffix character similar to @samp{*} except that it requires that | 453 @cindex @samp{+} in regexp |
421 the preceding expression be matched at least once. For example, | 454 is a quantifying suffix operator similar to @samp{*} except that the |
422 @samp{ca+r} will match the strings @samp{car} and @samp{caaaar} | 455 preceding expression must match at least once. It is also "greedy". |
423 but not the string @samp{cr}, whereas @samp{ca*r} would match all | 456 So, for example, @samp{ca+r} matches the strings @samp{car} and |
424 three strings.@refill | 457 @samp{caaaar} but not the string @samp{cr}, whereas @samp{ca*r} matches |
458 all three strings. | |
425 | 459 |
426 @item ? | 460 @item ? |
427 is a suffix character similar to @samp{*} except that it can match the | 461 @cindex @samp{?} in regexp |
428 preceding expression either once or not at all. For example, | 462 is a quantifying suffix operator similar to @samp{*}, except that the |
429 @samp{ca?r} will match @samp{car} or @samp{cr}; nothing else. | 463 preceding expression can match either once or not at all. For example, |
464 @samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anything | |
465 else. | |
466 | |
467 @item *? | |
468 @cindex @samp{*?} in regexp | |
469 works just like @samp{*}, except that rather than matching the longest | |
470 match, it matches the shortest match. @samp{*?} is known as a | |
471 @dfn{non-greedy} quantifier, a regexp construct borrowed from Perl. | |
472 @c Did perl get this from somewhere? What's the real history of *? ? | |
473 | |
474 This construct is very useful for when you want to match the text inside | |
475 a pair of delimiters. For instance, @samp{/\*.*?\*/} will match C | |
476 comments in a string. This could not easily be achieved without the use | |
477 of a non-greedy quantifier. | |
478 | |
479 This construct has not been available prior to XEmacs 20.4. It is not | |
480 available in FSF Emacs. | |
481 | |
482 @item +? | |
483 @cindex @samp{+?} in regexp | |
484 is the non-greedy version of @samp{+}. | |
485 | |
486 @item ?? | |
487 @cindex @samp{??} in regexp | |
488 is the non-greedy version of @samp{?}. | |
489 | |
490 @item \@{n,m\@} | |
491 @c Note the spacing after the close brace is deliberate. | |
492 @cindex @samp{\@{n,m\@} }in regexp | |
493 serves as an interval quantifier, analogous to @samp{*} or @samp{+}, but | |
494 specifies that the expression must match at least @var{n} times, but no | |
495 more than @var{m} times. This syntax is supported by most Unix regexp | |
496 utilities, and has been introduced to XEmacs for the version 20.3. | |
497 | |
498 Unfortunately, the non-greedy version of this quantifier does not exist | |
499 currently, although it does in Perl. | |
430 | 500 |
431 @item [ @dots{} ] | 501 @item [ @dots{} ] |
502 @cindex character set (in regexp) | |
503 @cindex @samp{[} in regexp | |
504 @cindex @samp{]} in regexp | |
432 @samp{[} begins a @dfn{character set}, which is terminated by a | 505 @samp{[} begins a @dfn{character set}, which is terminated by a |
433 @samp{]}. In the simplest case, the characters between the two form | 506 @samp{]}. In the simplest case, the characters between the two brackets |
434 the set. Thus, @samp{[ad]} matches either one @samp{a} or one | 507 form the set. Thus, @samp{[ad]} matches either one @samp{a} or one |
435 @samp{d}, and @samp{[ad]*} matches any string composed of just | 508 @samp{d}, and @samp{[ad]*} matches any string composed of just @samp{a}s |
436 @samp{a}s and @samp{d}s (including the empty string), from which it | 509 and @samp{d}s (including the empty string), from which it follows that |
437 follows that @samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr}, | 510 @samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr}, |
438 @samp{caddaar}, etc.@refill | 511 @samp{caddaar}, etc.@refill |
439 | 512 |
440 You can include character ranges in a character set by writing two | 513 The usual regular expression special characters are not special inside a |
514 character set. A completely different set of special characters exists | |
515 inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill | |
516 | |
517 @samp{-} is used for ranges of characters. To write a range, write two | |
441 characters with a @samp{-} between them. Thus, @samp{[a-z]} matches any | 518 characters with a @samp{-} between them. Thus, @samp{[a-z]} matches any |
442 lower-case letter. Ranges may be intermixed freely with individual | 519 lower case letter. Ranges may be intermixed freely with individual |
443 characters, as in @samp{[a-z$%.]}, which matches any lower-case letter | 520 characters, as in @samp{[a-z$%.]}, which matches any lower case letter |
444 or @samp{$}, @samp{%}, or period. | 521 or @samp{$}, @samp{%}, or a period.@refill |
445 @refill | 522 |
446 | 523 To include a @samp{]} in a character set, make it the first character. |
447 Note that inside a character set the usual special characters are not | 524 For example, @samp{[]a]} matches @samp{]} or @samp{a}. To include a |
448 special any more. A completely different set of special characters | 525 @samp{-}, write @samp{-} as the first character in the set, or put it |
449 exists inside character sets: @samp{]}, @samp{-}, and @samp{^}.@refill | 526 immediately after a range. (You can replace one individual character |
450 | 527 @var{c} with the range @samp{@var{c}-@var{c}} to make a place to put the |
451 To include a @samp{]} in a character set, you must make it the first | 528 @samp{-}.) There is no way to write a set containing just @samp{-} and |
452 character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To | 529 @samp{]}. |
453 include a @samp{-}, write @samp{---}, which is a range containing only | 530 |
454 @samp{-}. To include @samp{^}, make it other than the first character | 531 To include @samp{^} in a set, put it anywhere but at the beginning of |
455 in the set.@refill | 532 the set. |
456 | 533 |
457 @item [^ @dots{} ] | 534 @item [^ @dots{} ] |
535 @cindex @samp{^} in regexp | |
458 @samp{[^} begins a @dfn{complement character set}, which matches any | 536 @samp{[^} begins a @dfn{complement character set}, which matches any |
459 character except the ones specified. Thus, @samp{[^a-z0-9A-Z]} | 537 character except the ones specified. Thus, @samp{[^a-z0-9A-Z]} |
460 matches all characters @i{except} letters and digits.@refill | 538 matches all characters @emph{except} letters and digits.@refill |
461 | 539 |
462 @samp{^} is not special in a character set unless it is the first | 540 @samp{^} is not special in a character set unless it is the first |
463 character. The character following the @samp{^} is treated as if it | 541 character. The character following the @samp{^} is treated as if it |
464 were first (@samp{-} and @samp{]} are not special there). | 542 were first (thus, @samp{-} and @samp{]} are not special there). |
465 | 543 |
466 Note that a complement character set can match a newline, unless | 544 Note that a complement character set can match a newline, unless |
467 newline is mentioned as one of the characters not to match. | 545 newline is mentioned as one of the characters not to match. |
468 | 546 |
469 @item ^ | 547 @item ^ |
470 is a special character that matches the empty string, but only if at | 548 @cindex @samp{^} in regexp |
471 the beginning of a line in the text being matched. Otherwise, it fails | 549 @cindex beginning of line in regexp |
472 to match anything. Thus, @samp{^foo} matches a @samp{foo} that occurs | 550 is a special character that matches the empty string, but only at the |
473 at the beginning of a line. | 551 beginning of a line in the text being matched. Otherwise it fails to |
552 match anything. Thus, @samp{^foo} matches a @samp{foo} that occurs at | |
553 the beginning of a line. | |
554 | |
555 When matching a string instead of a buffer, @samp{^} matches at the | |
556 beginning of the string or after a newline character @samp{\n}. | |
474 | 557 |
475 @item $ | 558 @item $ |
559 @cindex @samp{$} in regexp | |
476 is similar to @samp{^} but matches only at the end of a line. Thus, | 560 is similar to @samp{^} but matches only at the end of a line. Thus, |
477 @samp{xx*$} matches a string of one @samp{x} or more at the end of a line. | 561 @samp{x+$} matches a string of one @samp{x} or more at the end of a line. |
562 | |
563 When matching a string instead of a buffer, @samp{$} matches at the end | |
564 of the string or before a newline character @samp{\n}. | |
478 | 565 |
479 @item \ | 566 @item \ |
480 does two things: it quotes the special characters (including | 567 @cindex @samp{\} in regexp |
568 has two functions: it quotes the special characters (including | |
481 @samp{\}), and it introduces additional special constructs. | 569 @samp{\}), and it introduces additional special constructs. |
482 | 570 |
483 Because @samp{\} quotes special characters, @samp{\$} is a regular | 571 Because @samp{\} quotes special characters, @samp{\$} is a regular |
484 expression that matches only @samp{$}, and @samp{\[} is a regular | 572 expression that matches only @samp{$}, and @samp{\[} is a regular |
485 expression that matches only @samp{[}, and so on.@refill | 573 expression that matches only @samp{[}, and so on. |
574 | |
575 @c Removed a paragraph here in lispref about doubling backslashes inside | |
576 @c of Lisp strings. | |
577 | |
486 @end table | 578 @end table |
487 | 579 |
488 Note: for historical compatibility, special characters are treated as | 580 @strong{Please note:} For historical compatibility, special characters |
489 ordinary ones if they are in contexts where their special meanings make no | 581 are treated as ordinary ones if they are in contexts where their special |
490 sense. For example, @samp{*foo} treats @samp{*} as ordinary since there is | 582 meanings make no sense. For example, @samp{*foo} treats @samp{*} as |
491 no preceding expression on which the @samp{*} can act. It is poor practice | 583 ordinary since there is no preceding expression on which the @samp{*} |
492 to depend on this behavior; better to quote the special character anyway, | 584 can act. It is poor practice to depend on this behavior; quote the |
493 regardless of where is appears.@refill | 585 special character anyway, regardless of where it appears.@refill |
494 | 586 |
495 Usually, @samp{\} followed by any character matches only | 587 For the most part, @samp{\} followed by any character matches only |
496 that character. However, there are several exceptions: characters | 588 that character. However, there are several exceptions: characters |
497 which, when preceded by @samp{\}, are special constructs. Such | 589 that, when preceded by @samp{\}, are special constructs. Such |
498 characters are always ordinary when encountered on their own. Here | 590 characters are always ordinary when encountered on their own. Here |
499 is a table of @samp{\} constructs. | 591 is a table of @samp{\} constructs: |
500 | 592 |
501 @table @kbd | 593 @table @kbd |
502 @item \| | 594 @item \| |
595 @cindex @samp{|} in regexp | |
596 @cindex regexp alternative | |
503 specifies an alternative. | 597 specifies an alternative. |
504 Two regular expressions @var{a} and @var{b} with @samp{\|} in | 598 Two regular expressions @var{a} and @var{b} with @samp{\|} in |
505 between form an expression that matches anything @var{a} or | 599 between form an expression that matches anything that either @var{a} or |
506 @var{b} matches.@refill | 600 @var{b} matches.@refill |
507 | 601 |
508 Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar} | 602 Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar} |
509 but no other string.@refill | 603 but no other string.@refill |
510 | 604 |
513 @samp{\|}.@refill | 607 @samp{\|}.@refill |
514 | 608 |
515 Full backtracking capability exists to handle multiple uses of @samp{\|}. | 609 Full backtracking capability exists to handle multiple uses of @samp{\|}. |
516 | 610 |
517 @item \( @dots{} \) | 611 @item \( @dots{} \) |
612 @cindex @samp{(} in regexp | |
613 @cindex @samp{)} in regexp | |
614 @cindex regexp grouping | |
518 is a grouping construct that serves three purposes: | 615 is a grouping construct that serves three purposes: |
519 | 616 |
520 @enumerate | 617 @enumerate |
521 @item | 618 @item |
522 To enclose a set of @samp{\|} alternatives for other operations. | 619 To enclose a set of @samp{\|} alternatives for other operations. |
523 Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}. | 620 Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}. |
524 | 621 |
525 @item | 622 @item |
526 To enclose a complicated expression for the postfix @samp{*} to operate on. | 623 To enclose an expression for a suffix operator such as @samp{*} to act |
527 Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any (zero or | 624 on. Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any |
528 more) number of @samp{na} strings.@refill | 625 (zero or more) number of @samp{na} strings.@refill |
529 | 626 |
530 @item | 627 @item |
531 To mark a matched substring for future reference. | 628 To record a matched substring for future reference. |
532 | |
533 @end enumerate | 629 @end enumerate |
534 | 630 |
535 This last application is not a consequence of the idea of a | 631 This last application is not a consequence of the idea of a |
536 parenthetical grouping; it is a separate feature which happens to be | 632 parenthetical grouping; it is a separate feature that happens to be |
537 assigned as a second meaning to the same @samp{\( @dots{} \)} construct | 633 assigned as a second meaning to the same @samp{\( @dots{} \)} construct |
538 because in practice there is no conflict between the two meanings. | 634 because there is no conflict in practice between the two meanings. |
539 Here is an explanation: | 635 Here is an explanation of this feature: |
540 | 636 |
541 @item \@var{digit} | 637 @item \@var{digit} |
542 after the end of a @samp{\( @dots{} \)} construct, the matcher remembers the | 638 matches the same text that matched the @var{digit}th occurrence of a |
543 beginning and end of the text matched by that construct. Then, later on | |
544 in the regular expression, you can use @samp{\} followed by @var{digit} | |
545 to mean ``match the same text matched the @var{digit}'th time by the | |
546 @samp{\( @dots{} \)} construct.''@refill | |
547 | |
548 The strings matching the first nine @samp{\( @dots{} \)} constructs appearing | |
549 in a regular expression are assigned numbers 1 through 9 in order that the | |
550 open-parentheses appear in the regular expression. @samp{\1} through | |
551 @samp{\9} may be used to refer to the text matched by the corresponding | |
552 @samp{\( @dots{} \)} construct. | 639 @samp{\( @dots{} \)} construct. |
640 | |
641 In other words, after the end of a @samp{\( @dots{} \)} construct. the | |
642 matcher remembers the beginning and end of the text matched by that | |
643 construct. Then, later on in the regular expression, you can use | |
644 @samp{\} followed by @var{digit} to match that same text, whatever it | |
645 may have been. | |
646 | |
647 The strings matching the first nine @samp{\( @dots{} \)} constructs | |
648 appearing in a regular expression are assigned numbers 1 through 9 in | |
649 the order that the open parentheses appear in the regular expression. | |
650 So you can use @samp{\1} through @samp{\9} to refer to the text matched | |
651 by the corresponding @samp{\( @dots{} \)} constructs. | |
553 | 652 |
554 For example, @samp{\(.*\)\1} matches any newline-free string that is | 653 For example, @samp{\(.*\)\1} matches any newline-free string that is |
555 composed of two identical halves. The @samp{\(.*\)} matches the first | 654 composed of two identical halves. The @samp{\(.*\)} matches the first |
556 half, which may be anything, but the @samp{\1} that follows must match | 655 half, which may be anything, but the @samp{\1} that follows must match |
557 the same exact text. | 656 the same exact text. |
558 | 657 |
658 @item \(?: @dots{} \) | |
659 @cindex @samp{\(?:} in regexp | |
660 @cindex regexp grouping | |
661 is called a @dfn{shy} grouping operator, and it is used just like | |
662 @samp{\( @dots{} \)}, except that it does not cause the matched | |
663 substring to be recorded for future reference. | |
664 | |
665 This is useful when you need a lot of grouping @samp{\( @dots{} \)} | |
666 constructs, but only want to remember one or two -- or if you have | |
667 more than nine groupings and need to use backreferences to refer to | |
668 the groupings at the end. | |
669 | |
670 Using @samp{\(?: @dots{} \)} rather than @samp{\( @dots{} \)} when you | |
671 don't need the captured substrings ought to speed up your programs some, | |
672 since it shortens the code path followed by the regular expression | |
673 engine, as well as the amount of memory allocation and string copying it | |
674 must do. The actual performance gain to be observed has not been | |
675 measured or quantified as of this writing. | |
676 @c This is used to good advantage by the font-locking code, and by | |
677 @c `regexp-opt.el'. | |
678 | |
679 The shy grouping operator has been borrowed from Perl, and has not been | |
680 available prior to XEmacs 20.3, nor is it available in FSF Emacs. | |
681 | |
682 @item \w | |
683 @cindex @samp{\w} in regexp | |
684 matches any word-constituent character. The editor syntax table | |
685 determines which characters these are. @xref{Syntax}. | |
686 | |
687 @item \W | |
688 @cindex @samp{\W} in regexp | |
689 matches any character that is not a word constituent. | |
690 | |
691 @item \s@var{code} | |
692 @cindex @samp{\s} in regexp | |
693 matches any character whose syntax is @var{code}. Here @var{code} is a | |
694 character that represents a syntax code: thus, @samp{w} for word | |
695 constituent, @samp{-} for whitespace, @samp{(} for open parenthesis, | |
696 etc. @xref{Syntax}, for a list of syntax codes and the characters that | |
697 stand for them. | |
698 | |
699 @item \S@var{code} | |
700 @cindex @samp{\S} in regexp | |
701 matches any character whose syntax is not @var{code}. | |
702 @end table | |
703 | |
704 The following regular expression constructs match the empty string---that is, | |
705 they don't use up any characters---but whether they match depends on the | |
706 context. | |
707 | |
708 @table @kbd | |
559 @item \` | 709 @item \` |
560 matches the empty string, provided it is at the beginning | 710 @cindex @samp{\`} in regexp |
561 of the buffer. | 711 matches the empty string, but only at the beginning |
712 of the buffer or string being matched against. | |
562 | 713 |
563 @item \' | 714 @item \' |
564 matches the empty string, provided it is at the end of | 715 @cindex @samp{\'} in regexp |
565 the buffer. | 716 matches the empty string, but only at the end of |
717 the buffer or string being matched against. | |
718 | |
719 @item \= | |
720 @cindex @samp{\=} in regexp | |
721 matches the empty string, but only at point. | |
722 (This construct is not defined when matching against a string.) | |
566 | 723 |
567 @item \b | 724 @item \b |
568 matches the empty string, provided it is at the beginning or | 725 @cindex @samp{\b} in regexp |
726 matches the empty string, but only at the beginning or | |
569 end of a word. Thus, @samp{\bfoo\b} matches any occurrence of | 727 end of a word. Thus, @samp{\bfoo\b} matches any occurrence of |
570 @samp{foo} as a separate word. @samp{\bballs?\b} matches | 728 @samp{foo} as a separate word. @samp{\bballs?\b} matches |
571 @samp{ball} or @samp{balls} as a separate word.@refill | 729 @samp{ball} or @samp{balls} as a separate word.@refill |
572 | 730 |
573 @item \B | 731 @item \B |
574 matches the empty string, provided it is @i{not} at the beginning or | 732 @cindex @samp{\B} in regexp |
733 matches the empty string, but @emph{not} at the beginning or | |
575 end of a word. | 734 end of a word. |
576 | 735 |
577 @item \< | 736 @item \< |
578 matches the empty string, provided it is at the beginning of a word. | 737 @cindex @samp{\<} in regexp |
738 matches the empty string, but only at the beginning of a word. | |
579 | 739 |
580 @item \> | 740 @item \> |
581 matches the empty string, provided it is at the end of a word. | 741 @cindex @samp{\>} in regexp |
582 | 742 matches the empty string, but only at the end of a word. |
583 @item \w | |
584 matches any word-constituent character. The editor syntax table | |
585 determines which characters these are. | |
586 | |
587 @item \W | |
588 matches any character that is not a word-constituent. | |
589 | |
590 @item \s@var{code} | |
591 matches any character whose syntax is @var{code}. @var{code} is a | |
592 character which represents a syntax code: thus, @samp{w} for word | |
593 constituent, @samp{-} for whitespace, @samp{(} for open-parenthesis, | |
594 etc. @xref{Syntax}.@refill | |
595 | |
596 @item \S@var{code} | |
597 matches any character whose syntax is not @var{code}. | |
598 @end table | 743 @end table |
599 | 744 |
600 Here is a complicated regexp used by Emacs to recognize the end of a | 745 Here is a complicated regexp used by Emacs to recognize the end of a |
601 sentence together with any whitespace that follows. It is given in Lisp | 746 sentence together with any whitespace that follows. It is given in Lisp |
602 syntax to enable you to distinguish the spaces from the tab characters. In | 747 syntax to enable you to distinguish the spaces from the tab characters. In |