Index of Section 3 Manual Pages

Interix / SUApcreperform.3Interix / SUA

PCRE(3)                                                   PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Certain  items  that may appear in regular expression pat-
       terns are more efficient than others. It is more efficient
       to use a character class like [aeiou] than a set of alter-
       natives such as (a|e|i|o|u). In general, the simplest con-
       struction  that provides the required behaviour is usually
       the most efficient. Jeffrey Friedl's book contains  a  lot
       of  discussion  about  optimizing  regular expressions for
       efficient performance.

       When a pattern begins with .* not in  parentheses,  or  in
       parentheses  that  are not the subject of a backreference,
       and the PCRE_DOTALL option is set, the pattern is  implic-
       itly  anchored  by  PCRE,  since  it can match only at the
       start of a subject string. However, if PCRE_DOTALL is  not
       set,  PCRE  cannot  make  this optimization, because the .
       metacharacter does not then match a newline,  and  if  the
       subject  string  contains  newlines, the pattern may match
       from the  character  immediately  following  one  of  them
       instead of from the very start. For example, the pattern

         .*second

       matches  the  subject "first\nand second" (where \n stands
       for a newline character), with the match starting  at  the
       seventh  character. In order to do this, PCRE has to retry
       the match starting after every newline in the subject.

       If you are using such a pattern with subject strings  that
       do  not contain newlines, the best performance is obtained
       by setting PCRE_DOTALL, or starting the pattern  with  ^.*
       to  indicate explicit anchoring. That saves PCRE from hav-
       ing to scan along the subject looking  for  a  newline  to
       restart at.

       Beware of patterns that contain nested indefinite repeats.
       These can take a long time to run when applied to a string
       that does not match. Consider the pattern fragment

         (a+)*

       This  can match "aaaa" in 33 different ways, and this num-
       ber increases very rapidly as the string gets longer. (The
       * repeat can match 0, 1, 2, 3, or 4 times, and for each of
       those cases other than 0, the + repeats can match  differ-
       ent  numbers  of times.) When the remainder of the pattern
       is such that the entire match is going to fail,  PCRE  has
       in principle to try every possible variation, and this can
       take an extremely long time.

       An optimization catches some of the more simple cases such
       as

         (a+)*b

       where a literal character follows. Before embarking on the
       standard matching procedure, PCRE checks that there  is  a
       "b"  later  in the subject string, and if there is not, it
       fails the match immediately. However,  when  there  is  no
       following  literal  this  optimization cannot be used. You
       can see the difference by comparing the behaviour of

         (a+)*\d

       with the pattern above. The former gives a failure  almost
       instantly  when applied to a whole line of "a" characters,
       whereas the latter takes an appreciable time with  strings
       longer than about 20 characters.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.



                                                          PCRE(3)

Interix / SUAHosted at SUA Community for Interix, SUA and SFUInterix / SUA