1 .TH FLEX 1 "26 May 1990" "Version 2.3"
3 flex - fast lexical analyzer generator
6 .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
10 is a tool for generating
12 programs which recognized lexical patterns in text.
15 the given input files, or its standard input if no file names are given,
16 for a description of a scanner to generate. The description is in
18 of regular expressions and C code, called
20 generates as output a C source file,
22 which defines a routine
24 This file is compiled and linked with the
26 library to produce an executable. When the executable is run,
27 it analyzes its input for occurrences
28 of the regular expressions. Whenever it finds one, it executes
29 the corresponding C code.
30 .SH SOME SIMPLE EXAMPLES
32 First some simple examples to get the flavor of how one uses
36 input specifies a scanner which whenever it encounters the string
37 "username" will replace it with the user's login name:
41 username printf( "%s", getlogin() );
44 By default, any text not matched by a
47 is copied to the output, so the net effect of this scanner is
48 to copy its input file to its output with each occurrence
49 of "username" expanded.
50 In this input, there is just one rule. "username" is the
52 and the "printf" is the
54 The "%%" marks the beginning of the rules.
56 Here's another simple example:
59 int num_lines = 0, num_chars = 0;
62 \\n ++num_lines; ++num_chars;
69 printf( "# of lines = %d, # of chars = %d\\n",
70 num_lines, num_chars );
74 This scanner counts the number of characters and the number
75 of lines in its input (it produces no output other than the
76 final report on the counts). The first line
77 declares two globals, "num_lines" and "num_chars", which are accessible
82 routine declared after the second "%%". There are two rules, one
83 which matches a newline ("\\n") and increments both the line count and
84 the character count, and one which matches any character other than
85 a newline (indicated by the "." regular expression).
87 A somewhat more complicated example:
90 /* scanner for a toy Pascal-like language */
93 /* need this for the call to atof() below */
103 printf( "An integer: %s (%d)\\n", yytext,
107 {DIGIT}+"."{DIGIT}* {
108 printf( "A float: %s (%g)\\n", yytext,
112 if|then|begin|end|procedure|function {
113 printf( "A keyword: %s\\n", yytext );
116 {ID} printf( "An identifier: %s\\n", yytext );
118 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
120 "{"[^}\\n]*"}" /* eat up one-line comments */
122 [ \\t\\n]+ /* eat up whitespace */
124 . printf( "Unrecognized character: %s\\n", yytext );
132 ++argv, --argc; /* skip over program name */
134 yyin = fopen( argv[0], "r" );
142 This is the beginnings of a simple scanner for a language like
143 Pascal. It identifies different types of
145 and reports on what it has seen.
147 The details of this example will be explained in the following
149 .SH FORMAT OF THE INPUT FILE
152 input file consists of three sections, separated by a line with just
166 section contains declarations of simple
168 definitions to simplify the scanner specification, and declarations of
170 which are explained in a later section.
172 Name definitions have the form:
178 The "name" is a word beginning with a letter or an underscore ('_')
179 followed by zero or more letters, digits, '_', or '-' (dash).
180 The definition is taken to begin at the first non-white-space character
181 following the name and continuing to the end of the line.
182 The definition can subsequently be referred to using "{name}", which
183 will expand to "(definition)". For example,
190 defines "DIGIT" to be a regular expression which matches a
192 "ID" to be a regular expression which matches a letter
193 followed by zero-or-more letters-or-digits.
194 A subsequent reference to
206 and matches one-or-more digits followed by a '.' followed
207 by zero-or-more digits.
213 input contains a series of rules of the form:
219 where the pattern must be unindented and the action must begin
222 See below for a further description of patterns and actions.
224 Finally, the user code section is simply copied to
227 It is used for companion routines which call or are called
228 by the scanner. The presence of this section is optional;
229 if it is missing, the second
231 in the input file may be skipped, too.
233 In the definitions and rules sections, any
235 text or text enclosed in
239 is copied verbatim to the output (with the %{}'s removed).
240 The %{}'s must appear unindented on lines by themselves.
242 In the rules section,
243 any indented or %{} text appearing before the
244 first rule may be used to declare variables
245 which are local to the scanning routine and (after the declarations)
246 code which is to be executed whenever the scanning routine is entered.
247 Other indented or %{} text in the rule section is still copied to the output,
248 but its meaning is not well-defined and it may well cause compile-time
249 errors (this feature is present for
251 compliance; see below for other such features).
253 In the definitions section, an unindented comment (i.e., a line
254 beginning with "/*") is also copied verbatim to the output up
255 to the next "*/". Also, any line in the definitions section
256 beginning with '#' is ignored, though this style of comment is
257 deprecated and may go away in the future.
259 The patterns in the input are written using an extended set of regular
260 expressions. These are:
263 x match the character 'x'
264 . any character except newline
265 [xyz] a "character class"; in this case, the pattern
266 matches either an 'x', a 'y', or a 'z'
267 [abj-oZ] a "character class" with a range in it; matches
268 an 'a', a 'b', any letter from 'j' through 'o',
270 [^A-Z] a "negated character class", i.e., any character
271 but those in the class. In this case, any
272 character EXCEPT an uppercase letter.
273 [^A-Z\\n] any character EXCEPT an uppercase letter or
275 r* zero or more r's, where r is any regular expression
277 r? zero or one r's (that is, "an optional r")
278 r{2,5} anywhere from two to five r's
279 r{2,} two or more r's
281 {name} the expansion of the "name" definition
284 the literal string: [xyz]"foo
285 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286 then the ANSI-C interpretation of \\x.
287 Otherwise, a literal 'X' (used to escape
288 operators such as '*')
289 \\123 the character with octal value 123
290 \\x2a the character with hexadecimal value 2a
291 (r) match an r; parentheses are used to override
292 precedence (see below)
295 rs the regular expression r followed by the
296 regular expression s; called "concatenation"
299 r|s either an r or an s
302 r/s an r but only if it is followed by an s. The
303 s is not part of the matched text. This type
304 of pattern is called as "trailing context".
305 ^r an r, but only at the beginning of a line
306 r$ an r, but only at the end of a line. Equivalent
310 <s>r an r, but only in start condition s (see
311 below for discussion of start conditions)
313 same, but in any of start conditions s1,
317 <<EOF>> an end-of-file
319 an end-of-file when in start condition s1 or s2
322 The regular expressions listed above are grouped according to
323 precedence, from highest precedence at the top to lowest at the bottom.
324 Those grouped together have equal precedence. For example,
336 since the '*' operator has higher precedence than concatenation,
337 and concatenation higher than alternation ('|'). This pattern
342 the string "ba" followed by zero-or-more r's.
343 To match "foo" or zero-or-more "bar"'s, use:
349 and to match zero-or-more "foo"'s-or-"bar"'s:
356 Some notes on patterns:
358 A negated character class such as the example "[^A-Z]"
360 .I will match a newline
361 unless "\\n" (or an equivalent escape sequence) is one of the
362 characters explicitly present in the negated character class
363 (e.g., "[^A-Z\\n]"). This is unlike how many other regular
364 expression tools treat negated character classes, but unfortunately
365 the inconsistency is historically entrenched.
366 Matching newlines means that a pattern like [^"]* can match an entire
367 input (overflowing the scanner's input buffer) unless there's another
370 A rule can have at most one instance of trailing context (the '/' operator
371 or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372 can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373 cannot be grouped inside parentheses. A '^' which does not occur at
374 the beginning of a rule or a '$' which does not occur at the end of
375 a rule loses its special properties and is treated as a normal character.
377 The following are illegal:
384 Note that the first of these, can be written "foo/bar\\n".
386 The following will result in '$' or '^' being treated as a normal character:
393 If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394 could be used (the special '|' action is explained below):
398 bar$ /* action goes here */
401 A similar trick will work for matching a foo or a
402 bar-at-the-beginning-of-a-line.
403 .SH HOW THE INPUT IS MATCHED
404 When the generated scanner is run, it analyzes its input looking
405 for strings which match any of its patterns. If it finds more than
406 one match, it takes the one matching the most text (for trailing
407 context rules, this includes the length of the trailing part, even
408 though it will then be returned to the input). If it finds two
409 or more matches of the same length, the
410 rule listed first in the
412 input file is chosen.
414 Once the match is determined, the text corresponding to the match
417 is made available in the global character pointer
419 and its length in the global integer
423 corresponding to the matched pattern is then executed (a more
424 detailed description of actions follows), and then the remaining
425 input is scanned for another match.
427 If no match is found, then the
429 is executed: the next character in the input is considered matched and
430 copied to the standard output. Thus, the simplest legal
438 which generates a scanner that simply copies its input (one character
439 at a time) to its output.
441 Each pattern in a rule has a corresponding action, which can be any
442 arbitrary C statement. The pattern ends at the first non-escaped
443 whitespace character; the remainder of the line is its action. If the
444 action is empty, then when the pattern is matched the input token
445 is simply discarded. For example, here is the specification for a program
446 which deletes all occurrences of "zap me" from its input:
453 (It will copy all other characters in the input to the output since
454 they will be matched by the default rule.)
456 Here is a program which compresses multiple blanks and tabs down to
457 a single blank, and throws away whitespace found at the end of a line:
461 [ \\t]+ putchar( ' ' );
462 [ \\t]+$ /* ignore this token */
466 If the action contains a '{', then the action spans till the balancing '}'
467 is found, and the action may cross multiple lines.
469 knows about C strings and comments and won't be fooled by braces found
470 within them, but also allows actions to begin with
472 and will consider the action to be all the text up to the next
474 (regardless of ordinary braces inside the action).
476 An action consisting solely of a vertical bar ('|') means "same as
477 the action for the next rule." See below for an illustration.
479 Actions can include arbitrary C code, including
481 statements to return a value to whatever routine called
485 is called it continues processing tokens from where it last left
486 off until it either reaches
487 the end of the file or executes a return. Once it reaches an end-of-file,
488 however, then any subsequent call to
490 will simply immediately return, unless
492 is first called (see below).
494 Actions are not allowed to modify yytext or yyleng.
496 There are a number of special directives which can be included within
500 copies yytext to the scanner's output.
503 followed by the name of a start condition places the scanner in the
504 corresponding start condition (see below).
507 directs the scanner to proceed on to the "second best" rule which matched the
508 input (or a prefix of the input). The rule is chosen as described
509 above in "How the Input is Matched", and
513 set up appropriately.
514 It may either be one which matched as much text
515 as the originally chosen rule but came later in the
517 input file, or one which matched less text.
518 For example, the following will both count the
519 words in the input and call the routine special() whenever "frob" is seen:
525 frob special(); REJECT;
526 [^ \\t\\n]+ ++word_count;
531 any "frob"'s in the input would not be counted as words, since the
532 scanner normally executes only one action per token.
535 are allowed, each one finding the next best choice to the currently
536 active rule. For example, when the following scanner scans the token
537 "abcd", it will write "abcdabcaba" to the output:
545 .|\\n /* eat up any unmatched character */
548 (The first three rules share the fourth's action since they use
549 the special '|' action.)
551 is a particularly expensive feature in terms scanner performance;
554 of the scanner's actions it will slow down
556 of the scanner's matching. Furthermore,
558 cannot be used with the
564 Note also that unlike the other special actions,
568 code immediately following it in the action will
573 tells the scanner that the next time it matches a rule, the corresponding
576 onto the current value of
578 rather than replacing it. For example, given the input "mega-kludge"
579 the following will write "mega-mega-kludge" to the output:
583 mega- ECHO; yymore();
587 First "mega-" is matched and echoed to the output. Then "kludge"
588 is matched, but the previous "mega-" is still hanging around at the
593 for the "kludge" rule will actually write "mega-kludge".
596 in the scanner's action entails a minor performance penalty in the
597 scanner's matching speed.
600 returns all but the first
602 characters of the current token back to the input stream, where they
603 will be rescanned when the scanner looks for the next match.
607 are adjusted appropriately (e.g.,
611 ). For example, on the input "foobar" the following will write out
616 foobar ECHO; yyless(3);
622 will cause the entire current input string to be scanned again. Unless you've
623 changed how the scanner will subsequently process its input (using
625 for example), this will result in an endless loop.
630 back onto the input stream. It will be the next character scanned.
631 The following action will take the current token and cause it
632 to be rescanned enclosed in parentheses.
638 for ( i = yyleng - 1; i >= 0; --i )
646 puts the given character back at the
648 of the input stream, pushing back strings must be done back-to-front.
651 reads the next character from the input stream. For example,
652 the following is one way to eat up C comments:
661 while ( (c = input()) != '*' &&
663 ; /* eat up text of comment */
667 while ( (c = input()) == '*' )
670 break; /* found the end */
675 error( "EOF in comment" );
682 (Note that if the scanner is compiled using
686 is instead referred to as
688 in order to avoid a name clash with the
690 stream by the name of
694 can be used in lieu of a return statement in an action. It terminates
695 the scanner and returns a 0 to the scanner's caller, indicating "all done".
696 Subsequent calls to the scanner will immediately return unless preceded
702 is also called when an end-of-file is encountered. It is a macro and
704 .SH THE GENERATED SCANNER
709 which contains the scanning routine
711 a number of tables used by it for matching tokens, and a number
712 of auxiliary routines and macros. By default,
714 is declared as follows:
719 ... various definitions and the actions in here ...
723 (If your environment supports function prototypes, then it will
724 be "int yylex( void )".) This definition may be changed by redefining
725 the "YY_DECL" macro. For example, you could use:
729 #define YY_DECL float lexscan( a, b ) float a, b;
732 to give the scanning routine the name
734 returning a float, and taking two floats as arguments. Note that
735 if you give arguments to the scanning routine using a
736 K&R-style/non-prototyped function declaration, you must terminate
737 the definition with a semi-colon (;).
741 is called, it scans tokens from the global input file
743 (which defaults to stdin). It continues until it either reaches
744 an end-of-file (at which point it returns the value 0) or
745 one of its actions executes a
748 In the former case, when called again the scanner will immediately
753 at the new input file. (
755 takes one argument, a
758 In the latter case (i.e., when an action
759 executes a return), the scanner may then be called again and it
760 will resume scanning where it left off.
762 By default (and for purposes of efficiency), the scanner uses
763 block-reads rather than simple
765 calls to read characters from
767 The nature of how it gets its input can be controlled by redefining the
770 YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771 action is to place up to
773 characters in the character array
775 and return in the integer variable
778 number of characters read or the constant YY_NULL (0 on Unix systems)
779 to indicate EOF. The default YY_INPUT reads from the
780 global file-pointer "yyin".
782 A sample redefinition of YY_INPUT (in the definitions
783 section of the input file):
788 #define YY_INPUT(buf,result,max_size) \\
790 int c = getchar(); \\
791 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
796 This definition will change the input processing to occur
797 one character at a time.
799 You also can add in things like keeping track of the
800 input line number this way; but don't expect your scanner to
803 When the scanner receives an end-of-file indication from YY_INPUT,
808 returns false (zero), then it is assumed that the
809 function has gone ahead and set up
811 to point to another input file, and scanning continues. If it returns
812 true (non-zero), then the scanner terminates, returning 0 to its
817 always returns 1. Presently, to redefine it you must first
818 "#undef yywrap", as it is currently implemented as a macro. As indicated
819 by the hedging in the previous sentence, it may be changed to
820 a true function in the near future.
822 The scanner writes its
826 global (default, stdout), which may be redefined by the user simply
827 by assigning it to some other
832 provides a mechanism for conditionally activating rules. Any rule
833 whose pattern is prefixed with "<sc>" will only be active when
834 the scanner is in the start condition named "sc". For example,
837 <STRING>[^"]* { /* eat up the string body ... */
842 will be active only when the scanner is in the "STRING" start
846 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
851 will be active only when the current start condition is
852 either "INITIAL", "STRING", or "QUOTE".
855 are declared in the definitions (first) section of the input
856 using unindented lines beginning with either
860 followed by a list of names.
863 start conditions, the latter
865 start conditions. A start condition is activated using the
867 action. Until the next
869 action is executed, rules with the given start
870 condition will be active and
871 rules with other start conditions will be inactive.
872 If the start condition is
874 then rules with no start conditions at all will also be active.
879 rules qualified with the start condition will be active.
880 A set of rules contingent on the same exclusive start condition
881 describe a scanner which is independent of any of the other rules in the
883 input. Because of this,
884 exclusive start conditions make it easy to specify "mini-scanners"
885 which scan portions of the input that are syntactically different
886 from the rest (e.g., comments).
888 If the distinction between inclusive and exclusive start conditions
889 is still a little vague, here's a simple example illustrating the
890 connection between the two. The set of rules:
895 <example>foo /* do something */
903 <INITIAL,example>foo /* do something */
909 any unmatched character) remains active in start conditions.
912 returns to the original state where only the rules with
913 no start conditions are active. This state can also be
914 referred to as the start-condition "INITIAL", so
918 (The parentheses around the start condition name are not required but
919 are considered good style.)
922 actions can also be given as indented code at the beginning
923 of the rules section. For example, the following will cause
924 the scanner to enter the "SPECIAL" start condition whenever
926 is called and the global variable
938 <SPECIAL>blahblahblah
939 ...more rules follow...
943 To illustrate the uses of start conditions,
944 here is a scanner which provides two different interpretations
945 of a string like "123.456". By default it will treat it as
946 as three tokens, the integer "123", a dot ('.'), and the integer "456".
947 But if the string is preceded earlier in the line by the string
949 it will treat it as a single token, the floating-point number
959 expect-floats BEGIN(expect);
961 <expect>[0-9]+"."[0-9]+ {
962 printf( "found a float, = %f\\n",
966 /* that's the end of the line, so
967 * we need another "expect-number"
968 * before we'll recognize any more
975 printf( "found an integer, = %d\\n",
979 "." printf( "found a dot\\n" );
982 Here is a scanner which recognizes (and discards) C comments while
983 maintaining a count of the current input line.
992 <comment>[^*\\n]* /* eat anything that's not a '*' */
993 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
994 <comment>\\n ++line_num;
995 <comment>"*"+"/" BEGIN(INITIAL);
998 Note that start-conditions names are really integer values and
999 can be stored as such. Thus, the above could be extended in the
1009 comment_caller = INITIAL;
1016 comment_caller = foo;
1020 <comment>[^*\\n]* /* eat anything that's not a '*' */
1021 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1022 <comment>\\n ++line_num;
1023 <comment>"*"+"/" BEGIN(comment_caller);
1026 One can then implement a "stack" of start conditions using an
1027 array of integers. (It is likely that such stacks will become
1030 feature in the future.) Note, though, that
1031 start conditions do not have their own name-space; %s's and %x's
1032 declare names in the same fashion as #define's.
1033 .SH MULTIPLE INPUT BUFFERS
1034 Some scanners (such as those which support "include" files)
1035 require reading from several input streams. As
1037 scanners do a large amount of buffering, one cannot control
1038 where the next input will be read from by simply writing a
1040 which is sensitive to the scanning context.
1042 is only called when the scanner reaches the end of its buffer, which
1043 may be a long time after scanning a statement such as an "include"
1044 which requires switching the input source.
1046 To negotiate these sorts of problems,
1048 provides a mechanism for creating and switching between multiple
1049 input buffers. An input buffer is created by using:
1052 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1057 pointer and a size and creates a buffer associated with the given
1058 file and large enough to hold
1060 characters (when in doubt, use
1062 for the size). It returns a
1064 handle, which may then be passed to other routines:
1067 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1070 switches the scanner's input buffer so subsequent tokens will
1074 .B yy_switch_to_buffer()
1075 may be used by yywrap() to sets things up for continued scanning, instead
1076 of opening a new file and pointing
1081 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1084 is used to reclaim the storage associated with a buffer.
1088 .B yy_create_buffer(),
1089 provided for compatibility with the C++ use of
1093 for creating and destroying dynamic objects.
1096 .B YY_CURRENT_BUFFER
1099 handle to the current buffer.
1101 Here is an example of using these features for writing a scanner
1102 which expands include files (the
1104 feature is discussed below):
1107 /* the "incl" state is used for picking up the name
1108 * of an include file
1113 #define MAX_INCLUDE_DEPTH 10
1114 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1115 int include_stack_ptr = 0;
1119 include BEGIN(incl);
1122 [^a-z\\n]*\\n? ECHO;
1124 <incl>[ \\t]* /* eat the whitespace */
1125 <incl>[^ \\t\\n]+ { /* got the include file name */
1126 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1128 fprintf( stderr, "Includes nested too deeply" );
1132 include_stack[include_stack_ptr++] =
1135 yyin = fopen( yytext, "r" );
1140 yy_switch_to_buffer(
1141 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1147 if ( --include_stack_ptr < 0 )
1153 yy_switch_to_buffer(
1154 include_stack[include_stack_ptr] );
1158 .SH END-OF-FILE RULES
1159 The special rule "<<EOF>>" indicates
1160 actions which are to be taken when an end-of-file is
1161 encountered and yywrap() returns non-zero (i.e., indicates
1162 no further files to process). The action must finish
1163 by doing one of four things:
1169 has been pointed at a new file to process;
1179 or, switching to a new buffer using
1180 .B yy_switch_to_buffer()
1181 as shown in the example above.
1183 <<EOF>> rules may not be used with other
1184 patterns; they may only be qualified with a list of start
1185 conditions. If an unqualified <<EOF>> rule is given, it
1188 start conditions which do not already have <<EOF>> actions. To
1189 specify an <<EOF>> rule for only the initial start condition, use
1196 These rules are useful for catching things like unclosed comments.
1203 ...other rules for dealing with quotes...
1206 error( "unterminated quote" );
1212 yyin = fopen( *filelist, "r" );
1220 .SH MISCELLANEOUS MACROS
1224 can be redefined to provide an action
1225 which is always executed prior to the matched rule's action. For example,
1226 it could be #define'd to call a routine to convert yytext to lower-case.
1230 may be redefined to provide an action which is always executed before
1231 the first scan (and before the scanner's internal initializations are done).
1232 For example, it could be used to call a routine to read
1233 in a data table or open a logging file.
1235 In the generated scanner, the actions are all gathered in one large
1236 switch statement and separated using
1238 which may be redefined. By default, it is simply a "break", to separate
1239 each rule's action from the following rule's.
1242 allows, for example, C++ users to
1243 #define YY_BREAK to do nothing (while being very careful that every
1244 rule ends with a "break" or a "return"!) to avoid suffering from
1245 unreachable statement warnings where because a rule's action ends with
1249 .SH INTERFACING WITH YACC
1250 One of the main uses of
1252 is as a companion to the
1256 parsers expect to call a routine named
1258 to find the next input token. The routine is supposed to
1259 return the type of the next token as well as putting any associated
1270 to instruct it to generate the file
1272 containing definitions of all the
1276 input. This file is then included in the
1278 scanner. For example, if one of the tokens is "TOK_NUMBER",
1279 part of the scanner might look like:
1288 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1291 .SH TRANSLATION TABLE
1292 In the name of POSIX compliance,
1295 .I translation table
1296 for mapping input characters into groups.
1297 The table is specified in the first section, and its format looks like:
1302 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1308 This example specifies that the characters 'a', 'b', 'c', and 'd'
1309 are to all be lumped into group #1, upper-case letters
1310 in group #2, digits in group #52, tabs, blanks, and newlines into
1313 no other characters will appear in the patterns.
1314 The group numbers are actually disregarded by
1317 serves, though, to lump characters together. Given the above
1318 table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1319 They both say, "match any character in group #1, followed by
1320 zero-or-more pairs of characters
1321 from group #2, followed by a character from group #52." Thus
1323 provides a crude way for introducing equivalence classes into
1324 the scanner specification.
1328 option (see below) coupled with the equivalence classes which
1330 automatically generates take care of virtually all the instances
1331 when one might consider using
1333 But what the hell, it's there if you want it.
1336 has the following options:
1339 Generate backtracking information to
1341 This is a list of scanner states which require backtracking
1342 and the input characters on which they do so. By adding rules one
1343 can remove backtracking states. If all backtracking states
1348 is used, the generated scanner will run faster (see the
1350 flag). Only users who wish to squeeze every last cycle out of their
1351 scanners need worry about this option. (See the section on PERFORMANCE
1352 CONSIDERATIONS below.)
1355 is a do-nothing, deprecated option included for POSIX compliance.
1358 in previous releases of
1361 specified table-compression options. This functionality is
1364 flag. To ease the the impact of this change, when
1368 it currently issues a warning message and assumes that
1370 was desired instead. In the future this "promotion" of
1374 will go away in the name of full POSIX compliance (unless
1375 the POSIX meaning is removed first).
1378 makes the generated scanner run in
1380 mode. Whenever a pattern is recognized and the global
1382 is non-zero (which is the default),
1383 the scanner will write to
1388 --accepting rule at line 53 ("the matched text")
1391 The line number refers to the location of the rule in the file
1392 defining the scanner (i.e., the file that was fed to flex). Messages
1393 are also generated when the scanner backtracks, accepts the
1394 default rule, reaches the end of its input buffer (or encounters
1395 a NUL; at this point, the two look the same as far as the scanner's concerned),
1396 or reaches an end-of-file.
1399 specifies (take your pick)
1403 No table compression is done. The result is large but fast.
1404 This option is equivalent to
1413 scanner. The case of letters given in the
1416 be ignored, and tokens in the input will be matched regardless of case. The
1417 matched text given in
1419 will have the preserved case (i.e., it will not be folded).
1422 is another do-nothing, deprecated option included only for
1426 generates a performance report to stderr. The report
1427 consists of comments regarding features of the
1429 input file which will cause a loss of performance in the resulting scanner.
1430 Note that the use of
1432 and variable trailing context (see the BUGS section in flex(1))
1433 entails a substantial performance penalty; use of
1440 flag entail minor performance penalties.
1445 (that unmatched scanner input is echoed to
1447 to be suppressed. If the scanner encounters input that does not
1448 match any of its rules, it aborts with an error. This option is
1449 useful for finding holes in a scanner's rule set.
1454 to write the scanner it generates to standard output instead
1463 a summary of statistics regarding the scanner it generates.
1464 Most of the statistics are meaningless to the casual
1467 first line identifies the version of
1469 which is useful for figuring
1470 out where you stand with respect to patches and new releases,
1471 and the next two lines give the date when the scanner was created
1472 and a summary of the flags which were in effect.
1478 scanner table representation should be used. This representation is
1479 about as fast as the full table representation
1482 and for some sets of patterns will be considerably smaller (and for
1483 others, larger). In general, if the pattern set contains both "keywords"
1484 and a catch-all, "identifier" rule, such as in the set:
1487 "case" return TOK_CASE;
1488 "switch" return TOK_SWITCH;
1490 "default" return TOK_DEFAULT;
1491 [a-z]+ return TOK_ID;
1494 then you're better off using the full table representation. If only
1495 the "identifier" rule is present and you then use a hash table or some such
1496 to detect the keywords, you're better off using
1500 This option is equivalent to
1509 scanner. Normally, scanners generated by
1511 always look ahead one
1512 character before deciding that a rule has been matched. At the cost of
1513 some scanning overhead,
1515 will generate a scanner which only looks ahead
1516 when needed. Such scanners are called
1518 because if you want to write a scanner for an interactive system such as a
1519 command shell, you will probably want the user's input to be terminated
1520 with a newline, and without
1522 the user will have to type a character in addition to the newline in order
1523 to have the newline recognized. This leads to dreadful interactive
1526 If all this seems to confusing, here's the general rule: if a human will
1527 be typing in input to your scanner, use
1529 otherwise don't; if you don't care about squeezing the utmost performance
1530 from your scanner and you
1531 don't want to make any assumptions about the input to your scanner,
1537 cannot be used in conjunction with
1552 directives. Without this option,
1554 peppers the generated scanner
1555 with #line directives so error messages in the actions will be correctly
1556 located with respect to the original
1558 input file, and not to
1559 the fairly meaningless line numbers of
1563 does not presently generate the necessary directives
1564 to "retarget" the line numbers for those parts of
1566 which it generated. So if there is an error in the generated code,
1567 a meaningless line number is reported.)
1574 mode. It will generate a lot of messages to
1577 the form of the input and the resultant non-deterministic and deterministic
1578 finite automata. This option is mostly for use in maintaining
1584 to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1585 characters. On some sites,
1587 is installed with this option as the default. On others, the default
1588 is 7-bit characters. To see which is the case, check the verbose
1590 output for "equivalence classes created". If the denominator of
1591 the number shown is 128, then by default
1593 is generating 7-bit characters. If it is 256, then the default is
1594 8-bit characters and the
1596 flag is not required (but may be a good idea to keep the scanner
1597 specification portable). Feeding a 7-bit scanner 8-bit characters
1598 will result in infinite loops, bus errors, or other such fireworks,
1599 so when in doubt, use the flag. Note that if equivalence classes
1600 are used, 8-bit scanners take only slightly more table space than
1601 7-bit scanners (128 bytes, to be exact); if equivalence classes are
1602 not used, however, then the tables may grow up to twice their
1606 controls the degree of table compression.
1612 .I equivalence classes,
1613 i.e., sets of characters
1614 which have identical lexical properties (for example, if the only
1615 appearance of digits in the
1617 input is in the character class
1618 "[0-9]" then the digits '0', '1', ..., '9' will all be put
1619 in the same equivalence class). Equivalence classes usually give
1620 dramatic reductions in the final table/object file sizes (typically
1621 a factor of 2-5) and are pretty cheap performance-wise (one array
1622 look-up per character scanned).
1627 scanner tables should be generated -
1629 should not compress the
1630 tables by taking advantages of similar transition functions for
1634 specifies that the alternate fast scanner representation (described
1644 .I meta-equivalence classes,
1645 which are sets of equivalence classes (or characters, if equivalence
1646 classes are not being used) that are commonly used together. Meta-equivalence
1647 classes are often a big win when using compressed tables, but they
1648 have a moderate performance impact (one or two "if" tests and one
1649 array look-up per character scanned).
1653 specifies that the scanner tables should be compressed but neither
1654 equivalence classes nor meta-equivalence classes should be used.
1662 do not make sense together - there is no opportunity for meta-equivalence
1663 classes if the table is not being compressed. Otherwise the options
1664 may be freely mixed.
1666 The default setting is
1668 which specifies that
1670 should generate equivalence classes
1671 and meta-equivalence classes. This setting provides the highest
1672 degree of table compression. You can trade off
1673 faster-executing scanners at the cost of larger tables with
1674 the following generally being true:
1687 Note that scanners with the smallest tables are usually generated and
1688 compiled the quickest, so
1689 during development you will usually want to use the default, maximal
1693 is often a good compromise between speed and size for production
1697 options are not cumulative; whenever the flag is encountered, the
1698 previous -C settings are forgotten.
1701 overrides the default skeleton file from which
1703 constructs its scanners. You'll never need this option unless you are doing
1705 maintenance or development.
1706 .SH PERFORMANCE CONSIDERATIONS
1707 The main design goal of
1709 is that it generate high-performance scanners. It has been optimized
1710 for dealing well with large sets of rules. Aside from the effects
1711 of table compression on scanner speed outlined above,
1712 there are a number of options/actions which degrade performance. These
1713 are, from most expensive to least:
1718 pattern sets that require backtracking
1719 arbitrary trailing context
1721 '^' beginning-of-line operator
1725 with the first three all being quite expensive and the last two
1729 should be avoided at all costs when performance is important.
1730 It is a particularly expensive option.
1732 Getting rid of backtracking is messy and often may be an enormous
1733 amount of work for a complicated scanner. In principal, one begins
1738 file. For example, on the input
1742 foo return TOK_KEYWORD;
1743 foobar return TOK_KEYWORD;
1746 the file looks like:
1749 State #6 is non-accepting -
1750 associated rule line numbers:
1752 out-transitions: [ o ]
1753 jam-transitions: EOF [ \\001-n p-\\177 ]
1755 State #8 is non-accepting -
1756 associated rule line numbers:
1758 out-transitions: [ a ]
1759 jam-transitions: EOF [ \\001-` b-\\177 ]
1761 State #9 is non-accepting -
1762 associated rule line numbers:
1764 out-transitions: [ r ]
1765 jam-transitions: EOF [ \\001-q s-\\177 ]
1767 Compressed tables always backtrack.
1770 The first few lines tell us that there's a scanner state in
1771 which it can make a transition on an 'o' but not on any other
1772 character, and that in that state the currently scanned text does not match
1773 any rule. The state occurs when trying to match the rules found
1774 at lines 2 and 3 in the input file.
1775 If the scanner is in that state and then reads
1776 something other than an 'o', it will have to backtrack to find
1777 a rule which is matched. With
1778 a bit of headscratching one can see that this must be the
1779 state it's in when it has seen "fo". When this has happened,
1780 if anything other than another 'o' is seen, the scanner will
1781 have to back up to simply match the 'f' (by the default rule).
1783 The comment regarding State #8 indicates there's a problem
1784 when "foob" has been scanned. Indeed, on any character other
1785 than a 'b', the scanner will have to back up to accept "foo".
1786 Similarly, the comment for State #9 concerns when "fooba" has
1789 The final comment reminds us that there's no point going to
1790 all the trouble of removing backtracking from the rules unless
1795 since there's no performance gain doing so with compressed scanners.
1797 The way to remove the backtracking is to add "error" rules:
1801 foo return TOK_KEYWORD;
1802 foobar return TOK_KEYWORD;
1807 /* false alarm, not really a keyword */
1813 Eliminating backtracking among a list of keywords can also be
1814 done using a "catch-all" rule:
1818 foo return TOK_KEYWORD;
1819 foobar return TOK_KEYWORD;
1821 [a-z]+ return TOK_ID;
1824 This is usually the best solution when appropriate.
1826 Backtracking messages tend to cascade.
1827 With a complicated set of rules it's not uncommon to get hundreds
1828 of messages. If one can decipher them, though, it often
1829 only takes a dozen or so rules to eliminate the backtracking (though
1830 it's easy to make a mistake and have an error rule accidentally match
1831 a valid token. A possible future
1833 feature will be to automatically add rules to eliminate backtracking).
1836 trailing context (where both the leading and trailing parts do not have
1837 a fixed length) entails almost the same performance loss as
1839 (i.e., substantial). So when possible a rule like:
1843 mouse|rat/(cat|dog) run();
1850 mouse/cat|dog run();
1858 mouse|rat/cat run();
1859 mouse|rat/dog run();
1862 Note that here the special '|' action does
1864 provide any savings, and can even make things worse (see
1868 Another area where the user can increase a scanner's performance
1869 (and one that's easier to implement) arises from the fact that
1870 the longer the tokens matched, the faster the scanner will run.
1871 This is because with long tokens the processing of most input
1872 characters takes place in the (short) inner scanning loop, and
1873 does not often have to go through the additional work of setting up
1874 the scanning environment (e.g.,
1876 for the action. Recall the scanner for C comments:
1883 "/*" BEGIN(comment);
1886 <comment>"*"+[^*/\\n]*
1887 <comment>\\n ++line_num;
1888 <comment>"*"+"/" BEGIN(INITIAL);
1891 This could be sped up by writing it as:
1898 "/*" BEGIN(comment);
1901 <comment>[^*\\n]*\\n ++line_num;
1902 <comment>"*"+[^*/\\n]*
1903 <comment>"*"+[^*/\\n]*\\n ++line_num;
1904 <comment>"*"+"/" BEGIN(INITIAL);
1907 Now instead of each newline requiring the processing of another
1908 action, recognizing the newlines is "distributed" over the other rules
1909 to keep the matched text as long as possible. Note that
1913 slow down the scanner! The speed of the scanner is independent
1914 of the number of rules or (modulo the considerations given at the
1915 beginning of this section) how complicated the rules are with
1916 regard to operators such as '*' and '|'.
1918 A final example in speeding up a scanner: suppose you want to scan
1919 through a file containing identifiers and keywords, one per line
1920 and with no other extraneous characters, and recognize all the
1921 keywords. A natural first approach is:
1930 while /* it's a keyword */
1932 .|\\n /* it's not a keyword */
1935 To eliminate the back-tracking, introduce a catch-all rule:
1944 while /* it's a keyword */
1947 .|\\n /* it's not a keyword */
1950 Now, if it's guaranteed that there's exactly one word per line,
1951 then we can reduce the total number of matches by a half by
1952 merging in the recognition of newlines with that of the other
1962 while\\n /* it's a keyword */
1965 .|\\n /* it's not a keyword */
1968 One has to be careful here, as we have now reintroduced backtracking
1969 into the scanner. In particular, while
1971 know that there will never be any characters in the input stream
1972 other than letters or newlines,
1974 can't figure this out, and it will plan for possibly needing backtracking
1975 when it has scanned a token like "auto" and then the next character
1976 is something other than a newline or a letter. Previously it would
1977 then just match the "auto" rule and be done, but now it has no "auto"
1978 rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1979 we could either duplicate all rules but without final newlines, or,
1980 since we never expect to encounter such an input and therefore don't
1981 how it's classified, we can introduce one more catch-all rule, this
1982 one which doesn't include a newline:
1991 while\\n /* it's a keyword */
1995 .|\\n /* it's not a keyword */
2000 this is about as fast as one can get a
2002 scanner to go for this particular problem.
2006 is slow when matching NUL's, particularly when a token contains
2008 It's best to write rules which match
2010 amounts of text if it's anticipated that the text will often include NUL's.
2011 .SH INCOMPATIBILITIES WITH LEX AND POSIX
2013 is a rewrite of the Unix
2015 tool (the two implementations do not share any code, though),
2016 with some extensions and incompatibilities, both of which
2017 are of concern to those who wish to write scanners acceptable
2018 to either implementation. At present, the POSIX
2021 very close to the original
2023 implementation, so some of these
2024 incompatibilities are also in conflict with the POSIX draft. But
2025 the intent is that except as noted below,
2027 as it presently stands will
2028 ultimately be POSIX conformant (i.e., that those areas of conflict with
2029 the POSIX draft will be resolved in
2031 favor). Please bear in
2032 mind that all the comments which follow are with regard to the POSIX
2034 standard of Summer 1989, and not the final document (or subsequent
2035 drafts); they are included so
2037 users can be aware of the standardization issues and those areas where
2039 may in the near future undergo changes incompatible with
2040 its current definition.
2043 is fully compatible with
2045 with the following exceptions:
2049 scanner internal variable
2051 is not supported. It is difficult to support this option efficiently,
2052 since it requires examining every character scanned and reexamining
2053 the characters when the scanner backs up.
2054 Things get more complicated when the end of buffer or file is reached or a
2055 NUL is scanned (since the scan must then be restarted with the proper line
2056 number count), or the user uses the yyless(), unput(), or REJECT actions,
2057 or the multiple input buffer functions.
2059 The fix is to add rules which, upon seeing a newline, increment
2060 yylineno. This is usually an easy process, though it can be a drag if some
2061 of the patterns can match multiple newlines along with other characters.
2063 yylineno is not part of the POSIX draft.
2067 routine is not redefinable, though it may be called to read characters
2068 following whatever has been matched by a rule. If
2070 encounters an end-of-file the normal
2072 processing is done. A ``real'' end-of-file is returned by
2077 Input is instead controlled by redefining the
2085 cannot be redefined is in accordance with the POSIX draft, but
2087 has not yet been accepted into the draft (and probably won't; it looks
2088 like the draft will simply not specify any way of controlling the
2089 scanner's input other than by making an initial assignment to
2093 scanners do not use stdio for input. Because of this, when writing an
2094 interactive scanner one must explicitly call fflush() on the
2095 stream associated with the terminal after writing out a prompt.
2098 such writes are automatically flushed since
2102 for their input. Also, when writing interactive scanners with
2109 scanners are not as reentrant as
2111 scanners. In particular, if you have an interactive scanner and
2112 an interrupt handler which long-jumps out of the scanner, and
2113 the scanner is subsequently called again, you may get the following
2117 fatal flex scanner internal error--end of buffer missed
2120 To reenter the scanner, first use
2131 macro is done to the file-pointer
2136 The POSIX draft mentions that an
2138 routine exists but currently gives no details as to what it does.
2141 does not support exclusive start conditions (%x), though they
2142 are in the current POSIX draft.
2144 When definitions are expanded,
2146 encloses them in parentheses.
2147 With lex, the following:
2152 foo{NAME}? printf( "Found it\\n" );
2156 will not match the string "foo" because when the macro
2157 is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2158 and the precedence is such that the '?' is associated with
2161 the rule will be expanded to
2162 "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2163 Note that because of this, the
2167 operators cannot be used in a
2171 The POSIX draft interpretation is the same as
2174 To specify a character class which matches anything but a left bracket (']'),
2177 one can use "[^]]" but with
2179 one must use "[^\\]]". The latter works with
2186 (generate a Ratfor scanner) option is not supported. It is not part
2189 If you are providing your own yywrap() routine, you must include a
2190 "#undef yywrap" in the definitions section (section 1). Note that
2191 the "#undef" will have to be enclosed in %{}'s.
2194 specifies that yywrap() is a function and this is very unlikely to change; so
2195 .I flex users are warned
2198 is likely to be changed to a function in the near future.
2205 are undefined until the next token is matched. This is not the case with
2207 or the present POSIX draft.
2209 The precedence of the
2211 (numeric range) operator is different.
2213 interprets "abc{1,3}" as "match one, two, or
2214 three occurrences of 'abc'", whereas
2216 interprets it as "match 'ab'
2217 followed by one, two, or three occurrences of 'c'". The latter is
2218 in agreement with the current POSIX draft.
2220 The precedence of the
2222 operator is different.
2224 interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2225 or 'bar' anywhere", whereas
2227 interprets it as "match either 'foo' or 'bar' if they come at the beginning
2228 of a line". The latter is in agreement with the current POSIX draft.
2230 To refer to yytext outside of the scanner source file,
2231 the correct definition with
2233 is "extern char *yytext" rather than "extern char yytext[]".
2234 This is contrary to the current POSIX draft but a point on which
2236 will not be changing, as the array representation entails a
2237 serious performance penalty. It is hoped that the POSIX draft will
2238 be emended to support the
2240 variety of declaration (as this is a fairly painless change to
2261 the first time the scanner is called, providing
2263 has not already been assigned to a non-NULL value. The difference is
2264 subtle, but the net effect is that with
2268 does not have a valid value until the scanner has been called.
2270 The special table-size declarations such as
2283 is #define'd so scanners may be written for use with either
2290 features are not included in
2292 or the POSIX draft standard:
2299 %{}'s around actions
2301 comments beginning with '#' (deprecated)
2302 multiple actions on a line
2305 This last feature refers to the fact that with
2307 you can put multiple actions on the same line, separated with
2308 semi-colons, while with
2313 foo handle_foo(); ++num_foos_seen;
2316 is (rather surprisingly) truncated to
2323 does not truncate the action. Actions that are not enclosed in
2324 braces are simply terminated at the end of the line.
2326 .I reject_used_but_not_detected undefined
2328 .I yymore_used_but_not_detected undefined -
2329 These errors can occur at compile time. They indicate that the
2336 failed to notice the fact, meaning that
2338 scanned the first two sections looking for occurrences of these actions
2339 and failed to find any, but somehow you snuck some in (via a #include
2340 file, for example). Make an explicit reference to the action in your
2342 input file. (Note that previously
2346 mechanism for dealing with this problem; this feature is still supported
2347 but now deprecated, and will go away soon unless the author hears from
2348 people who can argue compellingly that they need it.)
2350 .I flex scanner jammed -
2351 a scanner compiled with
2353 has encountered an input string which wasn't matched by
2356 .I flex input buffer overflowed -
2357 a scanner rule matched a string long enough to overflow the
2358 scanner's internal input buffer (16K bytes by default - controlled by
2360 in "flex.skel". Note that to redefine this macro, you must first
2364 .I scanner requires -8 flag -
2365 Your scanner specification includes recognizing 8-bit characters and
2366 you did not specify the -8 flag (and your site has not installed flex
2367 with -8 as the default).
2370 fatal flex scanner internal error--end of buffer missed -
2371 This can occur in an scanner which is reentered after a long-jump
2372 has jumped out (or over) the scanner's activation frame. Before
2373 reentering the scanner, use:
2380 .I too many %t classes! -
2381 You managed to put every single character into its own %t class.
2383 requires that at least one of the classes share characters.
2384 .SH DEFICIENCIES / BUGS
2388 flex(1), lex(1), yacc(1), sed(1), awk(1).
2390 M. E. Lesk and E. Schmidt,
2391 .I LEX - Lexical Analyzer Generator
2393 Vern Paxson, with the help of many ideas and much inspiration from
2394 Van Jacobson. Original version by Jef Poskanzer. The fast table
2395 representation is a partial implementation of a design done by Van
2396 Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2400 beta-testers, feedbackers, and contributors, especially Casey
2401 Leedom, benson@odi.com, Keith Bostic,
2402 Frederic Brehm, Nick Christopher, Jason Coughlin,
2403 Scott David Daniels, Leo Eskin,
2404 Chris Faylor, Eric Goldman, Eric
2405 Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2406 Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2407 Jef Poskanzer, Jim Roskind,
2408 Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2409 have slipped my marginal mail-archiving skills but whose contributions
2410 are appreciated all the same.
2412 Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2413 Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2416 Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2417 to Benson Margulies and Fred
2418 Burke for C++ support; to Ove Ewerlid for the basics of support for
2419 NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2421 Work is being done on extending
2423 to generate scanners in which the
2424 state machine is directly represented in C code rather than tables.
2425 These scanners may well be substantially faster than those generated
2426 using -f or -F. If you are working in this area and are interested
2427 in comparing notes and seeing whether redundant work can be avoided,
2428 contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2430 This work was primarily done when I was at the Real Time Systems Group
2431 at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2432 for the support I received.
2438 Computer Science Department
2441 Ithaca, NY 14853-7501