diff src/regex.c @ 502:7039e6323819

[xemacs-hg @ 2001-05-04 22:41:46 by ben] ----------------------- byte-comp warning fixes ----------------- New functions for cleanly eliminating byte-compiler warnings. Their definitions require no changes at all in bytecomp.el, meaning that any package that wants to use them and be compatible with older versions of XEmacs need only copy the code and rename the functions (i.e. prefix them with the package name). Eliminate byte-compiler warnings using the new functions in bytecomp-runtime.el. Move coding-system-put,get,category, since they're not Mule-specific and are used in prefer-coding-system. font.el was incredibly ugly. Clean it up. Avoid using defsubst for any exported functions, to avoid possible compatibility problems if we later change the internal interface. (It happened before, with face accessors, between 19.8 and 19.9). Fix tons of warnings. Clean up (new function gpm-is-supported-p eliminates duplicate code in gpm-create/delete-device-hook) and eliminate warnings. ---------- make byte-recompile-directory work in the --------- core `lisp' dir, even in the absence of a Mule XEmacs (i.e. make it skip the Mule files rather than trying to compile them). now you should be able to do `touch *.el' in the `lisp' dir, then M-x byte-recompile-directory, and get no warnings. Avoid trying to compile Mule files in byte-recompile-directory when we're not in a Mule XEmacs, since we're highly likely to get syntax errors. Add a coding-system cookie to all Mule files so that byte-recompile-directory ignores them. Magic cookie function moved to files.el from code-files.el (for use by bytecomp even in a non-coding-system XEmacs), and changed names and semantics for use by bytecomp. NOTE: IMO this is an internal function that we can change as we like (and there is absolutely no code anywhere else using the function). ---------------- GUI improvements: menus, help ------------------- Rearrange order of keymap declarations to be alphabetical. Improve help on help to include all bindings, and group by category. Add bindings for new Info commands. Remove warnings. Use command-hyper-apropos in place of command-apropos. Add a function to do the equivalent of command-apropos. Evals its help-text argument so you can put expressions there. Used now by help-for-help. Add binding to continue text searches. Expand index searches to work over multiple info documents. Add commands to search text/index in User and Lispref. Add new entry, "Uncomment Region" (parallels "Comment Out Region"). Redo Help menu; add bindings for new Info commands to search the index or text of the User and Lispref manuals. Add command for mark-paragraph, activate-region. Make Edit->R accelerator be rectangle, not register (more commonly used), and put rectangle first. Fix the Edit Init File entry to never load the .elc file. Simplify the default-popup-menu. Add Cmds->Tabs menu. Use kp-left not kp_left, etc. ---------------- Miscellaneous bug fixes/cleanup ------------------- byte-compiler-options: Correct doc string. easy-menu-do-define: fix extra quote. fill-paragraph-or-region:Rewrite to be more correct -- use call-interactively so that we always get exactly the same behavior as if the functions were called directly. No need to fiddle with zmacs-region-stays, now that bogus clearing of it (2001-04-28 src/ChangeLog) is removed. Put dialog titles back in -- this time correctly. Fix various other problems with leaks and such. key-sequence-list-description: Clean up fun to always correctly canonicalize. Clean up Kinsoku comments, synch comment-region with FSF 20.7. * simple.el (region-exists-p): * simple.el (region-active-p): Add comment about which one is correct to use in menu specs. * sound.el (load-sound-file): Minor code clean up. * startup.el: * startup.el (command-line-early): * startup.el (initial-scratch-message): Comment changes. Add info about sample.init.el to splash screen. Improve initial-scratch-message and clarify purpose of Scratch buffer. Fix byte-compile warning. ------------------------ Added features ------------------------- Add new variable to control whether etags checks all parent directories for tag files. (On by default.) * hash-table.el: New file, useful utility functions. * dumped-lisp.el (preloaded-file-list): Dump hash-table.el. ------------ notable bug fix: Windows event code -------------- Get critical quit working. ------------ notable bug fix and new feature: regex code -------------- Shy groups were implemented in a horrible, half-assed way that would cause them to screw up regex searching in most cases. Fixed to work correctly. Also extended back-reference syntax past 9. Only is recognized as such if there are at least that many non-shy groups; and optionally will warn about such uses, to catch old code that might be using them differently. (Added variable to control this in search.c -- `warn-about-possibly-incompatible-back- references', on by default for the moment. Declared in lisp.h. ---------------- process/SIGIO improvements ------------------- define USE_GETADDRINFO to replace more complex conditional, and use it. the code conditionalized on this in unix_open_network_stream had *serious* problems handling errors. it's now fixed, and major amounts of duplicate code between the two versions were combined. don't disable SIGIO and other interrupts unless CONNECT_NEEDS_SLOWED_INTERRUPTS is defined -- don't penalize OS's without bugs. similarly for a freebsd bug that was affecting all OS's. * s\ultrix.h: define CONNECT_NEEDS_SLOWED_INTERRUPTS, since that's the OS mentioned as having a kernel bug. * sysdep.c (request_sigio_on_device): * sysdep.c (unrequest_sigio_on_device): fix SIGIO problems on Linux. add check for O_ASYNC in case it's defined and FASYNC isn't. add comment about other ways to do SIGIO on Linux. * callproc.c (Fold_call_process_internal): * process.c (Fstart_process_internal): Deal with the possibility that `default-directory' doesn't have terminating slash. Correct comments about vfork. ---------------- Miscellaneous bug fixes/cleanup ------------------- * callint.c (Finteractive): Add lots of documentation -- exactly what the Lisp equivalents of all the interactive specs are. * console.h (struct console): change type of quit_char to Emchar. * event-msw.c (lstream_type_create_mswindows_selectable): spacing change. Eliminate events-mod.h and combine into events.h. * emacs.c: * emacs.c (make_arg_list_1): * emacs.c (main_1): A couple of char->Extbyte changes, add a comment. * glyphs-msw.c: Correct indentation of function defns to not exceed 80 cols. Try (sort of) to fix some code that sets the colors of the progress gauge. (Commented out) * keymap.c (syms_of_keymap): use DEFSYMBOL. * process.c (read_process_output): No need to fiddle with zmacs_region_stays, now that bogus clearing of it (see below) is removed. * search.c (Freplace_match): warning fix.
author ben
date Fri, 04 May 2001 22:42:35 +0000
parents 223736d75acb
children cd662ad69f40
line wrap: on
line diff
--- a/src/regex.c	Thu May 03 21:08:39 2001 +0000
+++ b/src/regex.c	Fri May 04 22:42:35 2001 +0000
@@ -415,7 +415,7 @@
 
         /* Start remembering the text that is matched, for storing in a
            register.  Followed by one byte with the register number, in
-           the range 0 to one less than the pattern buffer's re_nsub
+           the range 1 to the pattern buffer's re_ngroups
            field.  Then followed by one byte with the number of groups
            inner to this one.  (This last has to be part of the
            start_memory only because we need it in the on_failure_jump
@@ -424,7 +424,7 @@
 
         /* Stop remembering the text that is matched and store it in a
            memory register.  Followed by one byte with the register
-           number, in the range 0 to one less than `re_nsub' in the
+           number, in the range 1 to `re_ngroups' in the
            pattern buffer, and one byte with the number of inner groups,
            just like `start_memory'.  (We need the number of inner
            groups here because we don't have any easy way of finding the
@@ -971,6 +971,7 @@
     }
 
   printf ("re_nsub: %ld\t", (long)bufp->re_nsub);
+  printf ("re_ngroups: %ld\t", (long)bufp->re_ngroups);
   printf ("regs_alloc: %d\t", bufp->regs_allocated);
   printf ("can_be_null: %d\t", bufp->can_be_null);
   printf ("newline_anchor: %d\n", bufp->newline_anchor);
@@ -980,6 +981,20 @@
   printf ("syntax: %d\n", bufp->syntax);
   /* Perhaps we should print the translate table?  */
   /* and maybe the category table? */
+
+  if (bufp->external_to_internal_register)
+    {
+      int i;
+
+      printf ("external_to_internal_register:\n");
+      for (i = 0; i <= bufp->re_nsub; i++)
+	{
+	  if (i > 0)
+	    printf (", ");
+	  printf ("%d -> %d", i, bufp->external_to_internal_register[i]);
+	}
+      printf ("\n");
+    }
 }
 
 
@@ -1694,9 +1709,13 @@
 #define MAX_REGNUM 255
 
 /* But patterns can have more than `MAX_REGNUM' registers.  We just
-   ignore the excess.  */
+   ignore the excess.
+   #### not true!  groups past this will fail in lots of ways, if we
+   ever have to backtrack.
+  */
 typedef unsigned regnum_t;
 
+#define INIT_REG_TRANSLATE_SIZE 5
 
 /* Macros for the compile stack.  */
 
@@ -1880,7 +1899,9 @@
      `syntax' is set to SYNTAX;
      `used' is set to the length of the compiled pattern;
      `fastmap_accurate' is zero;
-     `re_nsub' is the number of subexpressions in PATTERN;
+     `re_ngroups' is the number of groups/subexpressions (including shy
+        groups) in PATTERN;
+     `re_nsub' is the number of non-shy groups in PATTERN;
      `not_bol' and `not_eol' are zero;
 
    The `fastmap' and `newline_anchor' fields are neither
@@ -1978,6 +1999,25 @@
 
   /* Always count groups, whether or not bufp->no_sub is set.  */
   bufp->re_nsub = 0;
+  bufp->re_ngroups = 0;
+
+  bufp->warned_about_incompatible_back_references = 0;
+
+  if (bufp->external_to_internal_register == 0)
+    {
+      bufp->external_to_internal_register_size = INIT_REG_TRANSLATE_SIZE;
+      RETALLOC (bufp->external_to_internal_register,
+		bufp->external_to_internal_register_size,
+		int);
+    }
+
+  {
+    int i;
+
+    bufp->external_to_internal_register[0] = 0;
+    for (i = 1; i < bufp->external_to_internal_register_size; i++)
+      bufp->external_to_internal_register[i] = (int) 0xDEADBEEF;
+  }
 
 #if !defined (emacs) && !defined (SYNTAX_TABLE)
   /* Initialize the syntax table.  */
@@ -2560,6 +2600,7 @@
             handle_open:
               {
                 regnum_t r;
+		int shy = 0;
 
                 if (!(syntax & RE_NO_SHY_GROUPS)
                     && p != pend
@@ -2570,7 +2611,7 @@
                     switch (c)
                       {
                       case ':': /* shy groups */
-                        r = MAX_REGNUM + 1;
+                        shy = 1;
                         break;
 
                       /* All others are reserved for future constructs. */
@@ -2578,11 +2619,32 @@
                         FREE_STACK_RETURN (REG_BADPAT);
                       }
                   }
-                else
-                  {
-                    bufp->re_nsub++;
-                    r = ++regnum;
-                  }
+
+		r = ++regnum;
+		bufp->re_ngroups++;
+		if (!shy)
+		  {
+		    bufp->re_nsub++;
+		    while (bufp->external_to_internal_register_size <=
+			   bufp->re_nsub)
+		      {
+			int i;
+			int old_size =
+			  bufp->external_to_internal_register_size;
+			bufp->external_to_internal_register_size += 5;
+			RETALLOC (bufp->external_to_internal_register,
+				  bufp->external_to_internal_register_size,
+				  int);
+			/* debugging */
+			for (i = old_size;
+			     i < bufp->external_to_internal_register_size; i++)
+			  bufp->external_to_internal_register[i] =
+			    (int) 0xDEADBEEF;
+		      }
+
+		    bufp->external_to_internal_register[bufp->re_nsub] =
+		      bufp->re_ngroups;
+		  }
 
                 if (COMPILE_STACK_FULL)
                   {
@@ -2606,7 +2668,10 @@
                 /* We will eventually replace the 0 with the number of
                    groups inner to this one.  But do not push a
                    start_memory for groups beyond the last one we can
-                   represent in the compiled pattern.  */
+                   represent in the compiled pattern.
+		   #### bad bad bad.  this will fail in lots of ways, if we
+		   ever have to backtrack for these groups.
+		*/
                 if (r <= MAX_REGNUM)
                   {
                     COMPILE_STACK_TOP.inner_group_offset
@@ -2996,21 +3061,59 @@
             case '1': case '2': case '3': case '4': case '5':
             case '6': case '7': case '8': case '9':
 	      {
-		regnum_t reg;
+		regnum_t reg, regint;
+		int may_need_to_unfetch = 0;
 		if (syntax & RE_NO_BK_REFS)
 		  goto normal_char;
 
+		/* This only goes up to 99.  It could be extended to work
+		   up to 255 (the maximum number of registers that can be
+		   handled by the current regexp engine, because it stores
+		   its register numbers in the compiled pattern as one byte,
+		   ugh).  Doing that's a bit trickier, because you might
+		   have the case where \25 a back-ref but \255 is not, ... */
 		reg = c - '0';
-
-		if (reg > regnum)
+		if (p < pend)
+		  {
+		    PATFETCH (c);
+		    if (c >= '0' && c <= '9')
+		      {
+			regnum_t new_reg = reg * 10 + c - '0';
+			if (new_reg <= bufp->re_nsub)
+			  {
+			    reg = new_reg;
+			    may_need_to_unfetch = 1;
+			  }
+			else
+			  PATUNFETCH;
+		      }
+		  }
+		  
+		if (reg > bufp->re_nsub)
 		  FREE_STACK_RETURN (REG_ESUBREG);
 
+		regint = bufp->external_to_internal_register[reg];
 		/* Can't back reference to a subexpression if inside of it.  */
-		if (group_in_compile_stack (compile_stack, reg))
-		  goto normal_char;
+		if (group_in_compile_stack (compile_stack, regint))
+		  {
+		    if (may_need_to_unfetch)
+		      PATUNFETCH;
+		    goto normal_char;
+		  }
+
+#ifdef emacs
+		if (reg > 9 &&
+		    bufp->warned_about_incompatible_back_references == 0)
+		  {
+		    bufp->warned_about_incompatible_back_references = 1;
+		    warn_when_safe (intern ("regex"), Qinfo,
+				    "Back reference \\%d now has new "
+				    "semantics in %s", reg, pattern);
+		  }
+#endif
 
 		laststart = buf_end;
-		BUF_PUSH_2 (duplicate, reg);
+		BUF_PUSH_2 (duplicate, regint);
 	      }
               break;
 
@@ -3125,7 +3228,7 @@
      isn't necessary unless we're trying to avoid calling alloca in
      the search and match routines.  */
   {
-    int num_regs = bufp->re_nsub + 1;
+    int num_regs = bufp->re_ngroups + 1;
 
     /* Since DOUBLE_FAIL_STACK refuses to double only if the current size
        is strictly greater than re_max_failures, the largest possible stack
@@ -4386,7 +4489,7 @@
   /* We fill all the registers internally, independent of what we
      return, for use in backreferences.  The number here includes
      an element for register zero.  */
-  unsigned num_regs = bufp->re_nsub + 1;
+  unsigned num_regs = bufp->re_ngroups + 1;
 
   /* The currently active registers.  */
   unsigned lowest_active_reg = NO_LOWEST_ACTIVE_REG;
@@ -4472,7 +4575,7 @@
      there are groups, we include space for register 0 (the whole
      pattern), even though we never use it, since it simplifies the
      array indexing.  We should fix this.  */
-  if (bufp->re_nsub)
+  if (bufp->re_ngroups)
     {
       regstart       = REGEX_TALLOC (num_regs, re_char *);
       regend         = REGEX_TALLOC (num_regs, re_char *);
@@ -4650,12 +4753,13 @@
           /* If caller wants register contents data back, do it.  */
           if (regs && !bufp->no_sub)
 	    {
+	      int num_nonshy_regs = bufp->re_nsub + 1;
               /* Have the register data arrays been allocated?  */
               if (bufp->regs_allocated == REGS_UNALLOCATED)
                 { /* No.  So allocate them with malloc.  We need one
                      extra element beyond `num_regs' for the `-1' marker
                      GNU code uses.  */
-                  regs->num_regs = MAX (RE_NREGS, num_regs + 1);
+                  regs->num_regs = MAX (RE_NREGS, num_nonshy_regs + 1);
                   regs->start = TALLOC (regs->num_regs, regoff_t);
                   regs->end = TALLOC (regs->num_regs, regoff_t);
                   if (regs->start == NULL || regs->end == NULL)
@@ -4669,9 +4773,9 @@
                 { /* Yes.  If we need more elements than were already
                      allocated, reallocate them.  If we need fewer, just
                      leave it alone.  */
-                  if (regs->num_regs < num_regs + 1)
+                  if (regs->num_regs < num_nonshy_regs + 1)
                     {
-                      regs->num_regs = num_regs + 1;
+                      regs->num_regs = num_nonshy_regs + 1;
                       RETALLOC (regs->start, regs->num_regs, regoff_t);
                       RETALLOC (regs->end, regs->num_regs, regoff_t);
                       if (regs->start == NULL || regs->end == NULL)
@@ -4701,16 +4805,19 @@
 
               /* Go through the first `min (num_regs, regs->num_regs)'
                  registers, since that is all we initialized.  */
-	      for (mcnt = 1; mcnt < MIN (num_regs, regs->num_regs); mcnt++)
+	      for (mcnt = 1; mcnt < MIN (num_nonshy_regs, regs->num_regs);
+		   mcnt++)
 		{
-                  if (REG_UNSET (regstart[mcnt]) || REG_UNSET (regend[mcnt]))
+		  int internal_reg = bufp->external_to_internal_register[mcnt];
+                  if (REG_UNSET (regstart[internal_reg]) ||
+		      REG_UNSET (regend[internal_reg]))
                     regs->start[mcnt] = regs->end[mcnt] = -1;
                   else
                     {
-		      regs->start[mcnt]
-			= (regoff_t) POINTER_TO_OFFSET (regstart[mcnt]);
-                      regs->end[mcnt]
-			= (regoff_t) POINTER_TO_OFFSET (regend[mcnt]);
+		      regs->start[mcnt] =
+			(regoff_t) POINTER_TO_OFFSET (regstart[internal_reg]);
+                      regs->end[mcnt] =
+			(regoff_t) POINTER_TO_OFFSET (regend[internal_reg]);
                     }
 		}
 
@@ -4719,7 +4826,7 @@
                  we (re)allocated the registers, this is the case,
                  because we always allocate enough to have at least one
                  -1 at the end.  */
-              for (mcnt = num_regs; mcnt < regs->num_regs; mcnt++)
+              for (mcnt = num_nonshy_regs; mcnt < regs->num_regs; mcnt++)
                 regs->start[mcnt] = regs->end[mcnt] = -1;
 	    } /* regs && !bufp->no_sub */
 
@@ -5065,11 +5172,15 @@
 
 
 	/* \<digit> has been turned into a `duplicate' command which is
-           followed by the numeric value of <digit> as the register number.  */
+           followed by the numeric value of <digit> as the register number.
+	   (Already passed through external-to-internal-register mapping,
+	   so it refers to the actual group number, not the non-shy-only
+	   numbering used in the external world.) */
         case duplicate:
 	  {
 	    REGISTER re_char *d2, *dend2;
-	    int regno = *p++;   /* Get which register to match against.  */
+	    /* Get which register to match against.  */
+	    int regno = *p++;
 	    DEBUG_PRINT2 ("EXECUTING duplicate %d.\n", regno);
 
 	    /* Can't back reference a group which we've never matched.  */
@@ -6222,6 +6333,8 @@
      `newline_anchor' to REG_NEWLINE being set in CFLAGS;
      `fastmap' and `fastmap_accurate' to zero;
      `re_nsub' to the number of subexpressions in PATTERN.
+     (non-shy of course.  POSIX probably doesn't know about
+     shy ones, and in any case they should be invisible.)
 
    PATTERN is the address of the pattern string.