Index of Section 1 Manual Pages

Interix / SUAlexdoc.1Interix / SUA

lexdoc(1)                                                     lexdoc(1)

  lexdoc

  NAME

    lexdoc - documentation for lex, fast lexical analyzer generator

  SYNOPSIS

    lex [-78BbcdFfhIiLlnpsTtVvw] [-C[aefFmr]] [-Pprefix]
        [-Sskeleton] [filename ...]

  DESCRIPTION

    The lex(1) utility is a tool for generating scanners, which are programs
    that recognized lexical patterns in text. The lex(1) utility reads the
    given input files, or its standard input if no file names are given, for a
    description of a scanner to generate. The description is in the form of
    pairs of regular expressions and C code, called rules. The lex(1) utility
    generates as output a C source file, lex.yy.c, which defines the routine
    yylex(). This file is compiled and linked with the -ll library to produce
    an executable. When the executable is run, it analyzes its input for
    occurrences of the regular expressions. Whenever it finds one, it executes
    the corresponding C code.

  SIMPLE EXAMPLES

    These simple examples illustrate how to use lex(1).

    The following lex(1) input specifies a scanner that, whenever it
    encounters the string "username", will replace it with the user's login
    name:

    %%
    username    printf( "%s", getlogin() );

    By default, any text not matched by a lex(1) scanner is copied to the
    output. The effect of this scanner is to copy its input file to its output
    with each occurrence of "username" expanded. In this input, there is one
    rule. The element "username" is the pattern, and "printf" is the action.
    The "%%" marks the beginning of the rules.

    Another simple example follows:

        int num_lines = 0, num_chars = 0;
    %%
    \n      ++num_lines; ++num_chars;
    %%
    main()
            {
            yylex();
            printf( "# of lines = %d, # of chars = %d\n",
                    num_lines, num_chars );
            }

    This scanner counts the number of characters and the number of lines in
    its input. It produces no output other than the final report on the
    counts. The first line declares two globals, "num_lines" and "num_chars",
    which are accessible both inside yylex() and in the main() routine
    declared after the second "%%". There are two rules, one that matches a
    newline ("\n") and increments both the line count and the character count,
    and one that matches any character other than a newline (indicated by the
    "." regular expression).

    The next example is slightly more complicated:

    /* scanner for a toy Pascal-like language */
    %{
    /* need this for the call to atof() below */
    #include 
    %}
    DIGIT    [0-9]
    ID       [a-z][a-z0-9]*

    %%
    {DIGIT}+    {
                printf( "An integer: %s (%d)\n", yytext,
                        atoi( yytext ) );
                }
    {DIGIT}+"."{DIGIT}*        {
                printf( "A float: %s (%g)\n", yytext,
                        atof( yytext ) );
                }
    if|then|begin|end|procedure|function        {
                printf( "A keyword: %s\n", yytext );
                }
    {ID}        printf( "An identifier: %s\n", yytext );
    "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
    "{"[^}\n]*"}"     /* consume one-line comments */
    [ \t\n]+          /* consume white space */

    %%
    main( argc, argv )
    int argc;
    char **argv;
        {
        ++argv, --argc;  /* skip over program name */
        if ( argc > 0 )
                yyin = fopen( argv[0], "r" );
        else
                yyin = stdin;
        yylex();
        }

    This is the beginning of a simple scanner for a language like Pascal. It
    identifies different types of tokens and reports what it has seen.

    The details of this example will be explained in the following sections.

  FORMAT OF THE INPUT FILE

    The lex(1) input file consists of three sections, separated by a line with
    just %% in it:

    definitions %% rules %% user code

    The definitions section contains declarations of simple name definitions
    to simplify the scanner specification, and declarations of start
    conditions, which are explained in a later section.

    Name definitions have the form:

    name definition

    The name is a word beginning with a letter or an underscore ('_'),
    followed by zero or more letters, digits, underscores ('_'), or dashes ('-
    '). The definition is taken to begin at the first non-white-space
    character following the name and continuing to the end of the line. The
    definition can subsequently be referred to using "{name}", which will
    expand to "(definition)". For example,

    DIGIT    [0-9]
    ID       [a-z][a-z0-9]*

    defines "DIGIT" to be a regular expression that matches a single digit,
    and "ID" to be a regular expression that matches a letter followed by zero
    or more letters or digits. A subsequent reference to:

    {DIGIT}+"."{DIGIT}*

    is identical to:

    ([0-9])+"."([0-9])*

    and matches one or more digits followed by a '.' that is followed by zero
    or more digits.

    The rules section of the lex(1) input contains a series of rules of the
    form:

    pattern   action

    where the pattern must be unindented and the action must begin on the same
    line.

    Patterns and actions are described in more detail later in this topic.

    Finally, the user code section is simply copied to lex.yy.c verbatim. It
    is used for companion routines that call or are called by the scanner. The
    presence of this section is optional; if it is missing, the second %% in
    the input file can also be skipped.

    In the definitions and rules sections, any indented text or text enclosed
    by %{ and %} is copied verbatim to the output (with instances of %{}
    removed). The instances of %{} must appear unindented on lines by
    themselves.

    In the rules section, any indented or %{} text appearing before the first
    rule can be used to declare variables that are local to the scanning
    routine and (after the declarations) code that will be executed whenever
    the scanning routine is entered. Other indented or %{} text in the rule
    section is still copied to the output, but its meaning is not well
    defined, and it might cause compile-time errors (this feature is present
    for POSIX compliance).

    In the definitions section (but not in the rules section), an unindented
    comment (that is, a line beginning with /*) is also copied verbatim to the
    output up to the next */.

  PATTERNS

    Patterns in the input are written using an extended set of regular
    expressions. These are described in the following table:

    Pattern            Matches

    x                  Match the character 'x'.

    .                  Any character except newline.

    [xyz]              A "character class"; in this case, the pattern matches
                       either an 'x', 'y', or 'z'.

    [abj-oZ]           A "character class" that contains a range; matches an
                       'a', 'b', any letter from 'j' through 'o', or a 'Z'.

    [^A-Z]             A "negated character class" (that is, any character
                       except those in the class. In this case, any character
                       except an uppercase letter).

    [^A-Z\n]           Any character except an uppercase letter or a newline.

    r*                 Zero or more instances of r, where r is any regular
                       expression.

    r+                 One or more instances of r.

    r?                 Zero or one instance ofr (that is, an optional r).

    r{2,5}             Two to five instances of r.

    r{2,}              Two or more instances of r.

    r{4}               Exactly four instances of r.

    {name}             The expansion of the name definition.

    "[xyz]\"star"      The literal string: [xyz]"star.

    \X                 If X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', the
                       ANSI C interpretation of \x. Otherwise, a literal 'X'
                       (used to escape operators such as '*').

    \123               The character with octal value 123.

    \x2a               The character with hexadecimal value 2a

    (r)                Match an r; parentheses are used to override precedence
                       (discussed later in this topic).

    rs                 "Concatenation:" the regular expression r followed by
                       the regular expression s.

    r|s                Either an r or an s.

    r/s                An r, but only if it is followed by an s. The s is not
                       part of the matched text. This type of pattern is
                       called "trailing context".

    ^r                 An r, but only at the beginning of a line.

    r$                 An r, but only at the end of a line. Equivalent to "r/
                       \n".

    r               An r, but only in start condition s. (See the
                       discussion of start conditions later in this topic.)

    r        Same, but in any of start conditions s1, s2, or s3.

    <*>r               An r in any start condition, even an exclusive one.

    <>            An end-of-file.

    <>     An end-of-file when in start condition s1 or s2.

    Note that inside a character class, all regular-expression operators lose
    their special meaning except escape ('\') and the character class
    operators, '-', ']', and, at the beginning of the class, '^'.

    The regular expressions listed above are grouped according to precedence,
    from highest precedence at the top to lowest at the bottom. Those grouped
    together have equal precedence. For example,

    cat|dog*

    is the same as

    (cat)|(do(g*))

    because the '*' operator has higher precedence than concatenation, and
    concatenation higher than alternation ('|').Therefore, this pattern
    matches either the string "cat" or the string "do" followed by zero or
    more instances of g. To match "cat" or zero or more instances of "dog",
    use:

    cat|(dog)*

    and to match zero or more instances of "cat" or "dog" use:

    (cat|dog)*

    Some notes on patterns:

    *     A negated character class such as the example "[^A-Z]" above will
          match a newline unless "\n" (or an equivalent escape sequence) is
          one of the characters explicitly present in the negated character
          class (for example: "[^A-Z\n]"). This is different from the way many
          other regular expression tools treat negated character classes.
          Unfortunately, however, the inconsistency is historically
          entrenched. Matching newlines means that a pattern like [^"]* can
          match the entire input unless there is another quote in the input.
    *     A rule can have no more than one instance of trailing context (the
          '/' operator or the '$' operator). The start condition, '^', and
          "<>" patterns can only occur at the beginning of a pattern,
          and, like '/' and '$', cannot be grouped inside parentheses. A '^'
          that does not occur at the beginning of a rule loses its special
          properties and is treated as a normal character. This is also true
          of a '$' that does not occur at the end of a rule.
    *     You cannot use the following:
          cat/dog$
          catdog
          You can, however, write the first of these as:
          cat/dog\n
    *     The following will result in '$' or '^' being treated as a normal
          character:
          cat|(dog$)
          cat|^dog
          If you want a "cat", or you want a dog followed by a newline, the
          following could be used (the special '|' action is explained below):
          cat      |
          dog$     /* action goes here */
          A similar trick will work for matching a cat or matching a dog at
          the beginning of a line.

  HOW INPUT IS MATCHED

    When the generated scanner is run, it analyzes its input looking for
    strings that match any of its patterns. If it finds more than one match,
    it takes the one matching the most text (for trailing-context rules, this
    includes the length of the trailing part, even though it will then be
    returned to the input). If it finds two or more matches of the same
    length, the rule listed first in the lex(1) input file is chosen.

    Once the match is determined, the text corresponding to the match (called
    the I token ) is made available in the global-character pointer yytext and
    its length in the global integer yyleng. The action corresponding to the
    matched pattern is then executed (a more detailed description of actions
    follows), and then the remaining input is scanned for another match.

    If no match is found, the default rule is executed: the next character in
    the input is considered matched and copied to the standard output. Thus,
    the simplest correct lex(1) input is:

    %%

    which generates a scanner that simply copies its input (one character at a
    time) to its output.

    Note that yytext can be defined in two ways: either as a character pointer
    or as a character array. You can control which definition lex(1) uses by
    including one of the special directives %pointer or %array in the first
    (definitions) section of your lex input. The default is %pointer, unless
    you use the -l AT&T lex compatibility option, in which case yytext will be
    an array. The advantage of using %pointer is that it provides
    substantially faster scanning and no buffer overflow when matching very
    large tokens (unless you run out of dynamic memory). The disadvantage is
    that you are restricted in how your actions can modify yytext (see the
    next section), and calls to the input() and unput() functions destroy the
    present contents of yytext. This can make moving between different lex(1)
    versions difficult.

    The advantage of %array is that you can modify yytext as much as you want,
    and calls to input() and unput() do not destroy yytext. Furthermore,
    existing AT&T lex(1) programs sometimes access yytext externally using
    declarations of the form:

    extern char yytext[];

    This definition is erroneous when used with %pointer, but correct for
    %array.

    %array defines yytext to be an array of YYLMAX characters, which defaults
    to a fairly large value. You can change the size by simply defining YYLMAX
    to a different value in the first section of your lex(1) input. As
    mentioned above, with %pointer yytext grows dynamically to accommodate
    large tokens. Although your %pointer scanner can accommodate very large
    tokens (such as matching entire blocks of comments), each time the scanner
    must resize yytext, it also must rescan the entire token from the
    beginning, so matching such tokens can prove slow. AT present, yytext does
    not dynamically grow if a call to unput() results in too much text being
    pushed back. Instead, a run-time error results.

  ACTIONS

    Each pattern in a rule has a corresponding action, which can be any
    arbitrary C statement. The pattern ends at the first non-escaped white-
    space character. The remainder of the line is its action. If the action is
    empty, when the pattern is matched, the input token is simply discarded.
    For example, here is the specification for a program that deletes all
    occurrences of "delete this string" from its input:

    %%
    "delete this string"

    (It will copy all other characters in the input to the output since they
    will be matched by the default rule.)

    The following program compresses multiple blanks and tabs down to a single
    blank, and throws away whites pace found at the end of a line:

    %%
    [ \t]+        putchar( ' ' );
    [ \t]+$       /* ignore this token */

    If the action contains a '{', the action spans the point up to which the
    balancing '}' is found; the action can cross multiple lines. The lex(1)
    utility knows about C strings and comments and will not be mislead by
    braces found within them. It allows actions to begin with %{ and will
    consider the action to be all the text up to the next %} (regardless of
    ordinary braces inside the action).

    An action consisting solely of a vertical bar ('|') means "same as the
    action for the next rule."

    Actions can include arbitrary C code, including return statements to
    return a value to the routine that called yylex(). Each time yylex() is
    called it continues processing tokens from the point at which it last left
    off until it either reaches the end of the file or executes a return.

    Actions can modify yytext, except for lengthening it (adding characters to
    its end--these will overwrite later characters in the input stream).
    Modifying the final character of yytext might alter whether rules anchored
    with '^' are active when scanning resumes. Specifically, changing the
    final character of yytext to a newline will activate such rules on the
    next scan, and changing it to anything else will deactivate the rules.
    Users should not rely on this behavior being present in future releases.
    Finally, note that none of this paragraph applies when using %array.

    Actions can to modify yyleng except they should not do so if the action
    also includes use of yymore().

    There are a number of special directives that can be included within an
    action:

    ECHO
        Copies yytext to the scanner's output.

    BEGIN
        Followed by the name of a start condition places the scanner in the
        corresponding start condition (see later in this topic).

    REJECT
        Directs the scanner to proceed on to the "second best" rule that
        matched the input (or a prefix of the input). The rule is chosen as
        described above in "How input is Matched"; yytext and yyleng set up
        appropriately. It can either be one which that as much text as the
        originally chosen rule but came later in the lex(1) input file, or one
        that matched less text. For example, the following will count the
        words in the input and call the routine special() whenever "frob" is
        seen:
            int word_count = 0;
        %%
        frob        special(); REJECT;
        [^ \t\n]+   ++word_count;
        Without the REJECT, any instances of "frob" in the input would not be
        counted as words, since the scanner normally executes only one action
        per token. Multiple occurrences of REJECT are allowed, each one
        finding the next best choice to the currently active rule. For
        example, when the following scanner scans the token "abcd", it will
        write "abcdabcaba" to the output:
        %%
        a        |
        ab       |
        abc      |
        abcd     ECHO; REJECT;
        .|\n     /* consume any unmatched character */
        (The first three rules share the action of the fourth because they use
        the special '|' action.) REJECT is a particularly expensive feature in
        terms scanner performance; if it is used in any of the scanner's
        actions it will slow down all of the scanner's matching. Furthermore,
        REJECT cannot be used with the -Cf or -CF options (discussed later in
        this topic).
        Note also that unlike the other special actions, REJECT is a branch;
        code immediately following it in the action will not be executed.
    yymore()
        Tells the scanner that the next time it matches a rule, the
        corresponding token should be appended onto the current value of
        yytext rather than replacing it. For example, given the input "hot-
        sun" the following will write "hot-hot-sun" to the output:
        %%
        hot-   ECHO; yymore();
        sun   ECHO;
        First "hot-" is matched and echoed to the output. Then "sun" is
        matched, but the previous "hot-" is still hanging around at the
        beginning of yytext so the ECHO for the "sun" rule will actually write
        "hot-sun". The presence of yymore() in the scanner's action entails a
        minor performance penalty in the scanner's matching speed.
    yyless(n)
        Returns all but the first n characters of the current token back to
        the input stream, where they will be rescanned when the scanner looks
        for the next match. yytext and yyleng are adjusted appropriately (for
        example, yyleng will now be equal to n ). On the input "catdog" the
        following will write out "catdogdog":
        %%
        catdog    ECHO; yyless(3);
        [a-z]+    ECHO;
        An argument of 0 to yyless() will cause the entire current input
        string to be scanned again. Unless you have changed how the scanner
        will subsequently process its input (using BEGIN, for example), this
        will result in an endless loop.
        Note that yyless is a macro and can only be used in the lex input
        file, not from other source files.
    unput(c)
        Puts the character c back onto the input stream. It will be the next
        character scanned. The following action will take the current token
        and cause it to be rescanned enclosed in parentheses.
        {
        int i;
        unput( ')' );
        for ( i = yyleng - 1; i >= 0; --i )
            unput( yytext[i] );
        unput( '(' );
        }
        Note that since each unput() puts the given character back at the
        beginning of the input stream, pushing back strings must be done back-
        to-front. Also note that you cannot put back EOF to attempt to mark
        the input stream with an end-of-file.
    input()
        Reads the next character from the input stream. For example, the
        following is one way to consume C comments:
        %%
        "/*"        {
                    register int c;
                    for ( ; ; )
                        {
                        while ( (c = input()) != '*' &&
                                c != EOF )
                            ;    /* consume text of comment */
                    if ( c == '*' )
                            {
                            while ( (c = input()) == '*' )
                                ;
                            if ( c == '/' )
                                break;    /* found the end */
                            }
                    if ( c == EOF )
                            {
                            error( "EOF in comment" );
                            break;
                            }
                        }
                    }
        (Note that if the scanner is compiled using C++ then input() is
        instead referred to as yyinput(), in order to avoid a name clash with
        the C++ stream by the name of input
    yyterminate()
        Can be used instead of a return statement in an action. It terminates
        the scanner and returns a 0 to the scanner's caller, indicating "all
        done". By default, yyterminate() is also called when an end-of-file is
        encountered. It is a macro and can be redefined.

  THE GENERATED SCANNER

    The output of lex(1) is the file lex.yy.c, which contains the scanning
    routine yylex(), a number of tables used by it for matching tokens, and a
    number of auxiliary routines and macros. By default, yylex() is declared
    as follows:

    int yylex()
        {
        ... various definitions and the actions are placed here ...
        }

    (If your environment supports function prototypes, it will be "int yylex(
    void )".) This definition can be changed by defining the "YY_DECL" macro.
    For example, you could use:

    #define YY_DECL float lexscan( a, b ) float a, b;

    to give the scanning routine the name lexscan(1), returning a float, and
    taking two floats as arguments. Note that if you give arguments to the
    scanning routine using a K&R-style, non-prototyped function declaration,
    you must terminate the definition with a semicolon (;).

    Whenever yylex() is called, it scans tokens from the global input file
    yyin (which defaults to stdin). It continues until it either reaches an
    end-of-file (at which point it returns the value 0) or one of its actions
    executes a return statement.

    If the scanner reaches an end-of-file, subsequent calls are undefined
    unless either yyin is pointed at a new input file (in which case scanning
    continues from that file), or yyrestart() is called. yyrestart() takes one
    argument, a FILE * pointer, and initializes yyin for scanning from that
    file. Essentially there is no difference between just assigning yyin to a
    new input file or using yyrestart() to do so; the latter is available for
    compatibility with previous versions of lex,(1) and because it can be used
    to switch input files in the middle of scanning. It can also be used to
    throw away the current input buffer, by calling it with an argument of
    yyin.

    If yylex() stops scanning due to executing a return statement in one of
    the actions, the scanner can be called again, and it will resume scanning
    where it left off.

    By default (and for purposes of efficiency), the scanner uses block-reads
    rather than simple getc(3) calls to read characters from yyin. The nature
    of how it gets its input can be controlled by defining the YY_INPUT macro.
    YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its action
    is to place up to max_size characters in the character array buf and
    return in the integer variable result either the number of characters read
    or the constant YY_NULL (traditionally 0) to indicate EOF. The default
    YY_INPUT reads from the global file pointer "yyin".

    A sample definition of YY_INPUT (in the definitions section of the input
    file):

    %{
    #define YY_INPUT(buf,result,max_size) \
        { \
        int c = getchar(); \
        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
        }
    %}

    This definition will change the input processing to occur one character at
    a time.

    You can also use this approach to add things, like keeping track of the
    input line number. If you do so, however, your scanner might not go very
    fast.

    When the scanner receives an end-of-file indication from YY_INPUT, it
    checks the yywrap() function. If yywrap() returns false (zero), it is
    assumed that the function has set up yyin to point to another input file,
    and scanning continues. If it returns true (non-zero), the scanner
    terminates, returning 0 to its caller.

    The default yywrap() always returns 1.

    The scanner writes its ECHO output to the yyout global (default, stdout),
    which may be redefined by the user simply by assigning it to some other
    FILE pointer.

  START CONDITIONS

    The lex(1) utility provides a mechanism for conditionally activating
    rules. Any rule whose pattern is prefixed with "" will be active only
    when the scanner is in the start condition named "sc". For example,

    [^"]*        { /* consume the string body ... */
                ...
                }

    will be active only when the scanner is in the "STRING" start condition,
    and

    \.        { /* handle an escape ... */
                ...
                }

    will be active only when the current start condition is either "INITIAL",
    "STRING", or "QUOTE".

    Start conditions are declared in the definitions (first) section of the
    input using unindented lines beginning with either %s or %x followed by a
    list of names. The former declares inclusive start conditions, the latter
    exclusive start conditions. A start condition is activated using the BEGIN
    action. Until the next BEGIN action is executed, rules with the given
    start condition will be active, and rules with other start conditions will
    be inactive. If the start condition is inclusive, rules with no start
    conditions will also be active. If it is exclusive, only rules qualified
    with the start condition will be active. A set of rules contingent on the
    same exclusive start condition describe a scanner that is independent of
    any of the other rules in the lex(1) input. Because of this, exclusive
    start conditions make it easy to specify "mini-scanners" which scan
    portions of the input that are syntactically different from the rest (such
    as comments).

    The following example illustrates the connection between inclusive and
    exclusive start conditions. The set of rules:

    %s example
    %%
    cat           /* do something */

    is equivalent to

    %x example
    %%
    cat   /* do something */

    Also note that the special start-condition specifier <*> matches every
    start condition. Thus, the above example could also have been written;

    %x example
    %%
    <*>cat   /* do something */

    The default rule (to ECHO any unmatched character) remains active in start
    conditions.

    BEGIN(0) returns to the original state where only rules with no start
    conditions are active. This state can also be referred to as the start-
    condition "INITIAL", so BEGIN(INITIAL) is equivalent to BEGIN(0). (The
    parentheses around the start condition name are not required but are
    considered good style.)

    BEGIN actions can also be given as indented code at the beginning of the
    rules section. For example, the following will cause the scanner to enter
    the "SPECIAL" start condition whenever yylex() is called and the global
    variable

    enter_special

    is true:

        int enter_special;
    %x SPECIAL
    %%
            if ( enter_special )
                BEGIN(SPECIAL);
    something here
    ...more rules follow...

    The following scanner illustrates the uses of start conditions. It gives
    two different interpretations of a string like "123.456". By default, it
    will treat it as as three tokens, the integer "123", a dot ('.'), and the
    integer "456". But if the string is preceded earlier in the line by the
    string "expect-floats" it will treat it as a single token, the floating-
    point number 123.456:

    %{
    #include 
    %}
    %s expect

    %%
    expect-floats        BEGIN(expect);
    [0-9]+"."[0-9]+      {
                printf( "found a float, = %f\n",
                        atof( yytext ) );
                }

    \n           {
                /* that's the end of the line, so
                 * we need another "expect-number"
                 * before we'll recognize any more
                 * numbers
                 */
                BEGIN(INITIAL);
                }

    [0-9]+      {
                printf( "found an integer, = %d\n",
                        atoi( yytext ) );
                }
    "."         printf( "found a dot\n" );

    The following scanner recognizes (and discards) C comments while
    maintaining a count of the current input line:

    %x comment
    %%
            int line_num = 1;
    "/*"         BEGIN(comment);
    [^*\n]*        /* consume anything that is not a '*' */
    "*"+[^*/\n]*   /* consume occurrences of '*' not followed by a '/
    ' */
    \n             ++line_num;
    "*"+"/"        BEGIN(INITIAL);

    This scanner will try to match as much text as possible with each rule. In
    general, when attempting to write a high-speed scanner, it is advisable to
    try to match as much possible in each rule.

    Note that start-condition names are really integer values and can be
    stored as such. Thus, the above could be extended in the following
    fashion:

    %x comment cat
    %%
            int line_num = 1;
            int comment_caller;
    "/*"         {
                 comment_caller = INITIAL;
                 BEGIN(comment);
                 }
    ...
    "/*"    {
                 comment_caller = cat;
                 BEGIN(comment);
                 }
    [^*\n]*        /* consume anything that is not a '*' */
    "*"+[^*/\n]*   /* consume '*'s not followed by '/'s */
    \n             ++line_num;
    "*"+"/"        BEGIN(comment_caller);

    You can also access the current start condition using the integer-valued
    YY_START macro. For example, the above assignments to comment_caller could
    instead be written

    comment_caller = YY_START;

    Note that start conditions do not have their own name space; %s and %x
    declare names the same as #define does.

    The following example illustrates how to match C-style quoted strings
    using exclusive start conditions, including expanded escape sequences (but
    not including checking for a string that is too long):

    %x str
    %%
            char string_buf[MAX_STR_CONST];
            char *string_buf_ptr;
    \"      string_buf_ptr = string_buf; BEGIN(str);

    \"        { /* saw closing quote - all done */
            BEGIN(INITIAL);
            *string_buf_ptr = '\0';
            /* return string constant token type and
             * value to parser
             */
            }

    \n        {
            /* error - unterminated string constant */
            /* generate error message */
            }

    \\[0-7]{1,3} {
            /* octal escape sequence */
            int result;
            (void) sscanf( yytext + 1, "%o", &result );
            if ( result > 0xff )
                    /* error, constant is out-of-bounds */
            *string_buf_ptr++ = result;
            }

    \\[0-9]+ {
            /* generate error - bad escape sequence; something
             * like '\48' or '\0777777'
             */
            }

    \\n  *string_buf_ptr++ = '\n';
    \\t  *string_buf_ptr++ = '\t';
    \\r  *string_buf_ptr++ = '\r';
    \\b  *string_buf_ptr++ = '\b';
    \\f  *string_buf_ptr++ = '\f';

    \\(.|\n)  *string_buf_ptr++ = yytext[1];
    [^\\\n\"]+        {
            char *yytext_ptr = yytext;
            while ( *yytext_ptr )
                    *string_buf_ptr++ = *yytext_ptr++;
            }

  MULTIPLE INPUT BUFFERS

    Some scanners (such as those that support include files) require reading
    from several input streams. Because lex(1) scanners do a large amount of
    buffering, one cannot control where the next input will be read from by
    simply writing a YY_INPUT that is sensitive to the scanning context.
    YY_INPUT is only called when the scanner reaches the end of its buffer.
    This can be a long time after scanning a statement such as an "include"
    that requires switching the input source.

    To alleviate these types of problems, lex(1) provides a mechanism for
    creating and switching between multiple input buffers. An input buffer is
    created by using:

    YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )

    This takes a FILE pointer and a size and creates a buffer associated with
    the given file that is large enough to hold size characters (when in
    doubt, use YY_BUF_SIZE for the size). It returns a YY_BUFFER_STATE handle,
    which can be passed to other routines:

    void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )

    switches the scanner's input buffer so subsequent tokens will come from
    new_buffer Note that yy_switch_to_buffer() can be used by yywrap() to set
    things up for continued scanning, instead of opening a new file and
    pointing yyin at it.

    void yy_delete_buffer( YY_BUFFER_STATE buffer )

    is used to reclaim the storage associated with a buffer.

    yy_new_buffer() is an alias for yy_create_buffer(), and is provided for
    compatibility with the C++ use of new and delete for creating and
    destroying dynamic objects.

    Finally, the YY_CURRENT_BUFFER macro returns a YY_BUFFER_STATE handle to
    the current buffer.

    The following example illustrates how to use these features to write a
    scanner that expands include files (the <> feature is discussed later
    in this topic):

    /* the "incl" state is used for picking up the name
     * of an include file
     */
    %x incl
    %{
    #define MAX_INCLUDE_DEPTH 10
    YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
    int include_stack_ptr = 0;
    %}

    %%
    include             BEGIN(incl);
    [a-z]+              ECHO;
    [^a-z\n]*\n?        ECHO;
    [ \t]*      /*consume the white space */
    [^ \t\n]+   { /* got the include file name */
            if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
                {
                fprintf( stderr, "Includes nested too deeply" );
                exit( 1 );
                }
            include_stack[include_stack_ptr++] =
                YY_CURRENT_BUFFER;
            yyin = fopen( yytext, "r" );

        if ( ! yyin )
                error( ... );
            yy_switch_to_buffer(
                yy_create_buffer( yyin, YY_BUF_SIZE ) );
            BEGIN(INITIAL);
            }

    <> {
            if ( --include_stack_ptr < 0 )
                {
                yyterminate();
                }
            else
                {
                yy_delete_buffer( YY_CURRENT_BUFFER );
                yy_switch_to_buffer(
                     include_stack[include_stack_ptr] );
                }
            }

  END-OF-FILE RULES

    The special rule "<>" indicates actions to be taken when an end-of-
    file is encountered and yywrap() returns non-zero (that is, indicates no
    further files to process). The action must finish by doing one of four
    things:

    *     Assign yyin to a new input file (in previous versions of lex, after
          doing the assignment you had to call the special action YY_NEW_FILE;
          this is no longer necessary).
    *     Execute a return statement.
    *     Execute the special yyterminate() action.
    *     Switch to a new buffer using yy_switch_to_buffer(), as shown in the
          example above.

    <> rules cannot be used with other patterns; they can only be
    qualified with a list of start conditions. If an unqualified <> rule
    is given, it applies to all start conditions that do not already have
    <> actions. To specify an <> rule for only the initial start
    condition, use

    <>

    These rules are useful for catching things like unclosed comments, as in
    the following example:

    %x quote
    %%
    ...other rules for dealing with quotes...
    <>   {
             error( "unterminated quote" );
             yyterminate();
             }
    <>  {
             if ( *++filelist )
                 yyin = fopen( *filelist, "r" );
             else
                yyterminate();
             }

  MISCELLANEOUS MACROS

    The macro YY_USER_ACTION can be defined to provide an action that is
    always executed prior to the action of the matched rule. For example, it
    could be defined (using #define) to call a routine to convert yytext to
    lowercase.

    The macro YY_USER_INIT can be defined to provide an action that is always
    executed before the first scan (and before the scanner's internal
    initializations are done). For example, it could be used to call a routine
    to read in a data table or open a logging file.

    In the generated scanner, the actions are all gathered in one large switch
    statement and separated using YY_BREAK, which may be redefined. By
    default, it is simply a "break", to separate each rule's action from
    action of the following rule. Redefining YY_BREAK allows, for example, C++
    users to #define YY_BREAK to do nothing (while being very careful that
    every rule ends with a "break" or a "return") to avoid getting unreachable
    statement warnings, where, because a rule's action ends with "return", the
    YY_BREAK is inaccessible.

  INTERFACING WITH YACC

    One of the ways to use lex(1) is as a companion to the yacc(1) parser-
    generator. The yacc(1) parsers expect to call a routine named yylex() to
    find the next input token. The routine is supposed to return the type of
    the next token, as well as putting any associated value in the global
    yylval To use lex(1) with yacc(1), one specifies the -d option to yacc(1).
    This instructs yacc(1) to generate the file y.tab.h, which contains
    definitions of all the %tokens appearing in the yacc(1) input. This file
    is then included in the lex(1) scanner. For example, if one of the tokens
    is "TOK_NUMBER", part of the scanner might look like:

    %{
    #include "y.tab.h"
    %}
    %%
    [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;

  LEX OPTIONS

    The lex(1) utility has the following options:

    -7
        Generate a seven-bit scanner, which can save considerable table space,
        especially when using -Cf or -CF (and, at most sites, -7 is on by
        default for these options. To see if this is the case, use the -
        v verbose flag and check the flag summary it reports).

    -8
        Generate an eight-bit scanner. This is the default, except for the -Cf
        and -CF compression options, for which the default is site-dependent,
        and can be checked by inspecting the flag summary generated by the -
        v option.

    -B
        Generate a batch scanner instead of an interactive scanner (see -
        I later in this discussion). Scanners using -Cf or -CF compression
        options also specify this option automatically .

    -b
        Generate backing-up information to lex.backup(1). This provides a list
        of scanner states that require backing up, and the input characters on
        which they do so. By adding rules, one can remove backing-up states.
        If all backing-up states are eliminated and either -Cf or -CF is used,
        the generated scanner will run faster.

    -C[aefFmr]
        Specify the degree of table compression and scanner optimization.

    -Ca
        Trade off larger tables in the generated scanner for faster
        performance because the elements of the tables are better aligned for
        memory access and computation. This option can double the size of the
        tables used by your scanner.

    -Ce
        Construct equivalence; that is, sets of characters that have identical
        lexical properties. Equivalence classes usually give dramatic
        reductions in the final table/object file sizes (typically a factor of
        2-5) and have little impact on performance (one array look-up per
        character scanned).

    -Cf
        Generate full scanner tables. The lex(1) utility should not compress
        the tables by taking advantages of similar transition functions for
        different states.

    -CF
        Use the alternate fast-scanner representation (described in
        lexdoc(1)).

    -Cm
        Construct meta-equivalence classes, which are sets of equivalence
        classes (or characters, if equivalence classes are not being used)
        that are commonly used together. Meta-equivalence classes are often
        effective when using compressed tables, but they have a moderate
        impact on performance (one or two "if" tests and one array look-up per
        character scanned).

    -Cr
        Bypass using stdio for input in generated scanner. In general this
        option results in a minor performance gain only worthwhile if used in
        conjunction with -Cf or -CF. It can cause surprising behavior if you
        use stdio yourself to read from yyin prior to calling the scanner.

    -C
        Alone, compress scanner tables, but use neither equivalence classes
        nor meta-equivalence classes.

    The options -Cf or -CF and -Cm do not make sense together; there is no
    opportunity for meta-equivalence classes if the table is not being
    compressed. Otherwise, the options can be freely mixed.

    The default setting is -Cem, which specifies that lex(1) should generate
    equivalence classes and meta-equivalence classes. This setting provides
    the highest degree of table compression. You can trade off faster-
    executing scanners at the cost of larger tables with the following
    generally being true:

        Slowest and smallest
            -Cem
            -Cm
            -Ce
            -C
            -C{f,F}e
            -C{f,F}
            -C{f,F}a
        Fastest and largest

    -C options are cumulative.

    -c
        Does nothing; a deprecated option included for POSIX compliance.

    NOTE:
        In previous releases of lex(1) -c specified table-compression options.
        This functionality is now given by the -C flag. To ease the impact of
        this change, when lex(1) encounters -c, it currently issues a warning
        message and assumes that you wanted -C instead. In the future this
        "promotion" of -c to -C will be eliminated in the name of full POSIX
        compliance (unless the POSIX meaning is removed first).

    -d
        Run the generated scanner in debug mode. Whenever a pattern is
        recognized and the global yy_flex_debug is non-zero (which is the
        default), the scanner will write to stderr a line of the form:
        --accepting rule at line 53 ("the matched text")
        The line number refers to the location of the rule in the file
        defining the scanner (that is, the file that was provided to lex).
        Messages are also generated when the scanner backs up, accepts the
        default rule, reaches the end of its input buffer (or encounters a
        NUL; to the scanner's concerned, the two look identical), or reaches
        an end-of-file.

    -F
        Use fast scanner table representation (and bypass stdio). This
        representation is about as fast as the full-table representation (-f),
        and for some sets of patterns will be considerably smaller (and for
        others, larger). See lexdoc(1) for more details.
        This option is equivalent to -CFr (discussed later in this topic).

    -f
        Use fast scanner. No table compression is done and stdio is bypassed.
        The result is large but fast. This option is equivalent to -Cfr
        (discussed later in this topic).

    -h
        Generate a "help" summary of lex's(1) options to stderr and then exit.

    -I
        Generate an interactive scanner, that is, a scanner that stops
        immediately rather than looking ahead if it knows that the currently
        scanned text cannot be part of a longer rule's match. This is the
        opposite of batch scanners (see -B above). See lexdoc(1) for details.
        Note that -I cannot be used in conjunction with full or fast tables;
        that is, the -f, -F, -Cf, or -CF flags. For other table-compression
        options, -I is the default.

    -i
        Generate a case-insensitive scanner. The case of letters given in the
        lex(1) input patterns will be ignored, and tokens in the input will be
        matched regardless of case. The matched text given in yytext will have
        the preserved case (that is, it will not be folded).

    -L
        Do not generate #line directives in lex.yy.c. The default is to
        generate such directives so error messages in the actions will be
        correctly located with respect to the original lex(1) input file, and
        not to the relatively meaningless line numbers of lex.yy.c.

    -l
        Turn on maximum compatibility with the original AT&T lex
        implementation, at a considerable performance cost. This option is
        incompatible with -f, -F, -Cf, and -CF. See lexdoc(1) for details.

    -n
        Does nothing. Another deprecated option included only for POSIX
        compliance.
    -Pprefix
        Change the default yy prefix used by lex(1) to be prefix instead. See
        lexdoc(1) for a description of all the global variables and file names
        that this affects.

    -p
        Generate a performance report to stderr. The report consists of
        comments regarding features of the lex(1) input file that will cause a
        loss of performance in the resulting scanner. If you give the flag
        twice, you will also get comments regarding features that lead to
        minor performance losses.
    -Sskeleton_file

        Use skeleton_file to construct the scanner instead of the default
        file. You will never need this option unless you are performing lex(1)
        maintenance or development.

    -s
        Suppress the default rule (that unmatched scanner input is echoed to
        stdout). If the scanner encounters input that does not match any of
        its rules, it aborts with an error.

    -T
        Run in trace mode. It will generate many messages to stderr concerning
        the form of the input and the resultant non-deterministic and
        deterministic finite automata. This option is mostly for use in
        maintaining lex(1).

    -t
        Write the scanner it generates to standard output instead of lex.yy.c.

    -V
        Print the version number to stderr and exits.

    -v
        Write to stderr a summary of statistics regarding the scanner it
        generates.

    -w
        Suppress warning messages.

  PERFORMANCE CONSIDERATIONS

    The main design goal of lex(1) is to generate high-performance scanners.
    It has been optimized for dealing well with large sets of rules. Aside
    from the effects on scanner speed of the table-compression -C options
    outlined above, there are a number of options and actions that degrade
    performance. These are provided below, from most to lease expensive, (with
    the first three having considerable impact on performance, and the last
    two having little impact on performance):

    *     REJECT
    *     Pattern sets that require backing up
    *     Arbitrary trailing context
    *     yymore()
    *     The ^ beginning-of-line operator

    Note also that unput() is implemented as a routine call that potentially
    does quite a bit of work, while the yyless() macro has little impact on
    performance; so if you are simply putting back some excess text you
    scanned, use yyless().

    Avoid REJECT at all costs when performance is important. It is a
    particularly expensive option.

    Eliminating backing up is difficult and can be an enormous amount of work
    for a complicated scanner. In principal, one begins by using the -b flag
    to generate a lex.backup file. For example, on the input

    %%
    cat        return TOK_KEYWORD;
    catdog     return TOK_KEYWORD;

    the file appears as follows:

    State #6 is non-accepting -
     associated rule line numbers:
           2       3
     out-transitions: [ o ]
     jam-transitions: EOF [ \001-n  p-\177 ]
    State #8 is non-accepting -
     associated rule line numbers:
           3
     out-transitions: [ a ]
     jam-transitions: EOF [ \001-'  b-\177 ]
    State #9 is non-accepting -
     associated rule line numbers:
           3
     out-transitions: [ r ]
     jam-transitions: EOF [ \001-q  s-\177 ]
    Compressed tables always back up.

    The first few lines indicate that a scanner state exists in which it can
    make a transition on an 'a' but not on any other character. These lines
    also indicate that, in that state, the currently scanned text does not
    match any rule. The state occurs when trying to match the rules found at
    lines 2 and 3 in the input file. If the scanner is in that state and then
    reads something other than an 'a', it will have to back up to find a rule
    that is matched. It becomes apparent that this must be the state the
    scanner is in when it has seen "ca". When this has happened, if anything
    other than another 'a' is seen, the scanner will have to back up to simply
    match the 'c' (by the default rule).

    The comment regarding State #8 indicates that there is a problem when
    "catd" has been scanned. Indeed, on any character other than an 'o', the
    scanner will have to back up to accept "cat". Similarly, the comment for
    State #9 concerns when "catdo" has been scanned and a 'g' does not follow.

    The final comment is a reminder that it is not useful to remove backing up
    from the rules unless -Cf or -CF is being used, since there is no gain in
    performance doing so with compressed scanners.

    The way to remove the backing up is to add "error" rules:

    %%
    cat         return TOK_KEYWORD;
    catdog      return TOK_KEYWORD;
    catdo       |
    catd        |
    ca         {
                /* false alarm, not really a keyword */
                return TOK_ID;
                }

    Eliminating backing up among a list of keywords can also be done using a
    "catch-all" rule:

    %%
    cat         return TOK_KEYWORD;
    catdog     return TOK_KEYWORD;
    [a-z]+      return TOK_ID;

    This is usually the best solution when appropriate.

    Backing up messages tend to cascade. With a complicated set of rules, it
    is not uncommon to get hundreds of messages. If one can decipher them,
    however, it often only takes about a dozen rules to eliminate the backing
    up. (It is easy, however, to make a mistake and have an error rule
    accidentally match a valid token. A possible future lex(1) feature might
    automatically add rules to eliminate backing up).

    Variable trailing context (where both the leading and trailing parts do
    not have a fixed length) entails almost the same performance loss as
    REJECT (that is, substantial). So when possible a rule like:

    %%
    mouse|rat/(cat|dog)   run();

    is better written:

    %%
    mouse/cat|dog         run();
    rat/cat|dog           run();

    or as

    %%
    mouse|rat/cat         run();
    mouse|rat/dog         run();

    Note that here the special '|' action does not have a positive impact on
    performance, and could even have a negative impact.

    A final note regarding performance: dynamically resizing yytext to
    accommodate huge tokens is a slow process because it presently requires
    that the (huge) token be rescanned from the beginning. If performance is
    vital, you should attempt to match "large" quantities of text but not
    "huge" quantities, where the cutoff between the two is at about 8 KB
    characters/token.

    Another area where you can increase a scanner's performance (and one that
    is easier to implement) arises from the fact that the longer the tokens
    that are matched, the faster the scanner will run. This is because with
    long tokens the processing of most input characters takes place in the
    (short) inner scanning loop, and does not often have to go through the
    additional work of setting up the scanning environment (such as yytext for
    the action. Recall the scanner for C comments:

    %x comment
    %%
            int line_num = 1;
    "/*"         BEGIN(comment);
    [^*\n]*
    "*"+[^*/\n]*
    \n             ++line_num;
    "*"+"/"        BEGIN(INITIAL);

    This could be sped up by writing it as:

    %x comment
    %%
            int line_num = 1;
    "/*"         BEGIN(comment);
    [^*\n]*
    [^*\n]*\n      ++line_num;
    "*"+[^*/\n]*
    "*"+[^*/\n]*\n ++line_num;
    "*"+"/"        BEGIN(INITIAL);

    Instead of each newline requiring the processing of another action, the
    recognition of the newlines is "distributed" over the other rules to keep
    the matched text as long as possible. Note that adding rules does not slow
    down the scanner. The speed of the scanner is independent of the number of
    rules or (modulo the considerations given at the beginning of this
    section) how complicated the rules are with regard to operators such as
    '*' and '|'.

    The following is a final example for speeding up a scanner. It presumes
    that you want to scan through a file that contains identifiers and
    keywords, one per line and with no other extraneous characters,
    recognizing all the keywords. The example provides a logical first
    approach:

    %%
    asm      |
    auto     |
    break    |
    ... etc ...
    volatile |
    while    /* it is a keyword */
    .|\n     /* it is not a keyword */

    To eliminate the back-tracking, introduce a catch-all rule:

    %%
    asm      |
    auto     |
    break    |
    ... etc ...
    volatile |
    while    /* it is a keyword */
    [a-z]+   |
    .|\n     /* it is not a keyword */

    If it is certain that there is exactly one word per line, the total number
    of matches can be reduced by half by merging in the recognition of
    newlines with that of the other tokens:

    %%
    asm\n    |
    auto\n   |
    break\n  |
    ... etc ...
    volatile\n |
    while\n  /* it is a keyword */
    [a-z]+\n |
    .|\n     /* it is not a keyword */

    Caution should be used as backing up into the scanner has been
    reintroduced. In particular, while you might know that there will never be
    any characters in the input stream other than letters or newlines, lex(1)
    cannot know this; it will plan for possibly needing to back up when it has
    scanned a token like "auto" and the next character is something other than
    a newline or a letter. Previously, it would then just match the "auto"
    rule and be done, but now it has no "auto" rule, only a "auto\n" rule. To
    eliminate the possibility of backing up, you could either duplicate all
    rules without final newlines, or, because you never expect to encounter
    such input, and therefore do not know how it is classified, you could
    introduce one more catch-all rule, this one, which does not include a
    newline:

    %%
    asm\n    |
    auto\n   |
    break\n  |
    ... etc ...
    volatile\n |
    while\n  /* it is a keyword */
    [a-z]+\n |
    [a-z]+   |
    .|\n     /* it is not a keyword */

    Compiled with -Cf, this is about as fast as one can get a lex(1) scanner
    to go for this particular problem.

    A final note: lex(1) is slow when matching NULs, particularly when a token
    contains multiple NULs. It is best to write rules that match short amounts
    of text if you anticipate that the text will often include NULs.

  INCOMPATIBILITIES WITH AT&T LEX AND POSIX

    This lex(1) is a rewrite of the AT&T lex(1) tool (the two implementations
    do not share any code, however), with some extensions and
    incompatibilities, both of which are of concern to those who want to write
    scanners acceptable to either implementation. The POSIX lex(1)
    specification is closer to the behavior of this implementation of lex(1)
    than to the original AT&T lex(1) implementation. Some incompatibilities
    remain, however, between this lex(1) and POSIX. The intent is that,
    ultimately, this lex(1) will be fully POSIX-conformant. This section
    discusses all of the known areas of incompatibility.

    The lex(1) -l option turns on maximum compatibility with the original AT&T
    lex(1) implementation, at the cost of a major loss in the generated
    scanner's performance. The incompatibilities that can be overcome using
    the -l option are discussed later in this section.

    The lex(1) utility is fully compatible with AT&T lex(1) with the following
    exceptions:

    *     The undocumented AT&T lex(1) scanner internal variable yylineno is
          not supported unless -l is used.
    *     The yylineno variable is not part of the POSIX specification.
    *     The input() routine is not redefinable, though it can be called to
          read characters following whatever has been matched by a rule. If
          input() encounters an end-of-file, the normal yywrap() processing is
          done. A "real" end-of-file is returned by input() as EOF.
    *     Input is controlled instead by defining the YY_INPUT macro.
    *     The lex(1) restriction that input() cannot be redefined is in
          accordance with the POSIX specification, which simply does not
          specify any way of controlling the scanner's input other than by
          making an initial assignment to yyin
    *     These lex(1) scanners are not as reentrant as AT&T lex(1) scanners.
          In particular, if you have an interactive scanner, and an interrupt
          handler that long-jumps out of the scanner, and the scanner is
          subsequently called again, you might get the following message:

    fatal lex scanner internal error--end of buffer missed

    To reenter the scanner, first use

    yyrestart( yyin );

    Note that this call will discard any buffered input; usually this is not a
    problem with an interactive scanner.

    The output() function is not supported. Output from the ECHO macro is done
    to the file-pointer yyout (default stdout).

    output() is not part of the POSIX specification.

    AT&T lex(1) does not support exclusive start conditions (%x) even though
    they are in the POSIX specification.

    When definitions are expanded, lex(1) encloses them in parentheses. With
    AT&T lex, the following:

    NAME    [A-Z][A-Z0-9]*
    %%
    cat{NAME}?      printf( "Found it\n" );
    %%

    will not match the string "cat" because when the macro is expanded the
    rule is equivalent to "cat[A-Z][A-Z0-9]*?" and the precedence is such that
    the '?' is associated with "[A-Z0-9]*". With this lex(1), the rule will be
    expanded to "cat([A-Z][A-Z0-9]*)?" and so the string "cat" will match.

    Note that if the definition begins with ^ or ends with $, it is not
    expanded with parentheses, which allows these operators to appear in
    definitions without losing their special meanings. But the , /, and
    <> operators cannot be used in a lex(1) definition.

    Using -l results in the AT&T lex(1) behavior; there are no parentheses
    around the definition.

    The POSIX specification is that the definition be enclosed in parentheses.

    The AT&T lex(1) %r (generate a Ratfor scanner) option is not supported. It
    is not part of the POSIX specification. After a call to unput(), yytext
    and yyleng are undefined until the next token is matched, unless the
    scanner was built using %array. This is not the case with AT&T lex(1) or
    the POSIX specification. The -l option eliminates with this
    incompatibility.

    The precedence of the {} (numeric range) operator is different. AT&T
    lex(1) interprets "abc{1,3}" as "match one, two, or three occurrences of
    'abc'"; whereas lex(1) interprets it as "match 'ab' followed by one, two,
    or three occurrences of 'c'". The latter is in agreement with the POSIX
    specification.

    The precedence of the ^ operator is different. AT&T lex(1) interprets
    "^cat|dog" as "match either 'cat' at the beginning of a line, or 'dog'
    anywhere"; whereas lex(1) interprets it as "match either 'cat' or 'dog' if
    they come at the beginning of a line". The latter is in agreement with the
    POSIX specification.

    AT&T lex(1) initializes yyin to be stdin; lex,(1), on the other hand,
    initializes yyin to NULL, and then assigns it to stdin the first time the
    scanner is called, providing yyin has not already been assigned to a non-
    NULL value. The difference is subtle, but the effect is that with lex(1)
    scanners, yyin does not have a valid value until the scanner has been
    called.

    The -l option eliminates this incompatibility.

    The special table-size declarations such as %a supported by AT&T and POSIX
    lex(1) are not required by lex(1) scanners; lex(1) ignores them.

    The name FLEX_SCANNER is defined (by #define) so that scanners can be
    written for use with either this lex(1) (which is actually a version of
    the flex scanner-generator) or AT&T lex(1).

    The following lex(1) features are not included in AT&T lex(1) or the POSIX
    specification:

    *     yyterminate()
    *     <>
    *     <*>
    *     YY_DECL
    *     YY_START
    *     YY_USER_ACTION
    *     #line directives
    *     The placement of %{} around actions
    *     Multiple actions on a line
    *     Almost all of the lex(1) flags

    The next-to-last feature in the list refers to the fact that, with lex(1),
    you can put multiple actions on the same line, separated with semicolons,
    while with AT&T lex,(1) the following

    cat    handle_cat(); ++num_cats_seen;

    is (rather surprisingly) truncated to

    cat   handle_cat();

    The lex(1) utility does not truncate the action. Actions that are not
    enclosed in braces are simply terminated at the end of the line.

  DIAGNOSTICS

    The lex(1) utility produces the following diagnostic messages:

    warning, rule cannot be matched
        Indicates that the given rule cannot be matched because it follows
        other rules that will always match the same text as it. For example,
        in the following "dog" cannot be matched because it comes after an
        identifier "catch-all" rule:
        [a-z]+    got_identifier();
        dog       got_dog();
        Using REJECT in a scanner suppresses this warning.

    warning, -s option given but default rule can be matched
        It is possible (perhaps only in a particular start condition) that the
        default rule (match any single character) is the only one that will
        match a particular input. Since -s was given, presumably this is not
        intended.

    reject_used_but_not_detected undefined
        This is a compile-time error; the scanner uses REJECT but lex(1) did
        not find it in the first two sections. Usually, this happens because
        the reference is implicit (through an #include file, for example).
        Make an explicit reference to the action in your lex(1) input file.
        (Note that previously lex(1) supported a %used mechanism for dealing
        with this problem. Although this feature is still supported, it is now
        deprecated.)

    yymore_used_but_not_detected undefined
        This is a compile-time error; the scanner uses yymore(3) but lex(1)
        did not find it in the first two sections. Usually this happens
        because the reference is implicit (through an #include file, for
        example). Make an explicit reference to the action in your lex(1)
        input file. (Note that previously lex(1) supported a %used mechanism
        for handling this problem. Although this feature is still supported,
        it is now deprecated.)

    lex scanner jammed
        A scanner compiled with -s has encountered an input string which was
        not matched by any of its rules. This error can also occur due to
        internal problems.

    token too large, exceeds YYLMAX
        Your scanner uses %array, and one of its rules matched a string longer
        than the YYLMAX constant (eight KB bytes by default). You can increase
        the value by defining (using #define) YYLMAX in the definitions
        section of your lex(1) input.

    scanner requires -8 flag to use the character 'x'
        Your scanner specification includes recognizing the eight-bit
        character 'x', you did not specify the -8 flag, and your scanner
        defaulted to seven-bit because you used the -Cf or -CF table-
        compression options. See the discussion of the -7 flag for details.

    lex scanner push-back overflow
        You used unput() to push back so much text that the scanner's buffer
        could not hold both the pushed-back text and the current token in
        yytext Ideally, the scanner should dynamically resize the buffer in
        this case; at present, however, it does not.

    input buffer overflow, can't enlarge buffer because scanner uses REJECT
        The scanner was working on matching an extremely large token and
        needed to expand the input buffer. This does not work with scanners
        that use REJECT.

    fatal lex scanner internal error--end of buffer missed
        This can occur in an scanner that is reentered after a long-jump has
        jumped out (or over) the scanner's activation frame. Before reentering
        the scanner, use:
        yyrestart( yyin );

    too many start conditions in <> construct!
        You listed more start conditions in a <> construct than exist (you
        must have listed at least one of them twice).

  FILES

    See lex(1).

  DEFICIENCIES / BUGS

    See lex(1).

  AUTHOR

    Vern Paxson, with the help of many ideas and much inspiration from Van
    Jacobson. Original version by Jef Poskanzer. The fast table representation
    is a partial implementation of a design done by Van Jacobson. The
    implementation was done by Kevin Gong and Vern Paxson.

    Thanks to the many lex(1) beta-testers, feedbackers, and contributors,
    especially Francois Pinard, Casey Leedom, Nelson H.F. Beebe,
    benson@odi.com, Peter A. Bigot, Keith Bostic, Frederic Brehm, Nick
    Christopher, Jason Coughlin, Bill Cox, Dave Curtis, Scott David Daniels,
    Chris G. Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
    Chris Faylor, Jon Forrest, Kaveh R. Ghazi, Eric Goldman, Ulrich Grepel,
    Jan Hajic, Jarkko Hietaniemi, Eric Hughes, John Interrante, Ceriel Jacobs,
    Jeffrey R. Jones, Henry Juengst, Amir Katz, ken@ken.hilco.com, Kevin B.
    Kenny, Marq Kole, Ronald Lamprecht, Greg Lee, Craig Leres, John Levine,
    Steve Liddle, Mohamed el Lozy, Brian Madsen, Chris Metcalf, Luke Mewburn,
    Jim Meyering, G.T. Nicol, Landon Noll, Marc Nozell, Richard Ohnemus, Sven
    Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef
    Poskanzer, Joe Rahmeh, Frederic Raimbault, Rick Richardson, Kevin Rodgers,
    Jim Roskind, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex
    Siegel, Mike Stump, Paul Stuart, Dave Tallman, Chris Thewalt, Paul
    Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
    Yap, Nathan Zelle, David Zuhn, and those whose names have slipped my
    marginal mail-archiving skills but whose contributions are appreciated all
    the same.

    Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, Craig
    Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, Rich Salz,
    and Richard Stallman for help with various distribution headaches.

    Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
    Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
    Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
    Eric Hughes for support of multiple buffers.

    This work was primarily done when I was with the Real Time Systems Group
    at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all
    there for the support I received.

    Send comments to:

    Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley
    Laboratory University of California Berkeley, CA 94720 vern@ee.lbl.gov

  SEE ALSO

    lex(1)

    yacc(1)

    sed(1)

    awk(1)

    M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator


Interix / SUAHosted at SUA Community for Interix, SUA and SFUInterix / SUA