Regular Expressions

© Copyright Herbert J. Bernstein 2009 All Rights Reserved

A regular expression is a pattern used to match strings of characters. Many applications, such as compilers and editors, make use of regular expressions to parse strings of text. There are many excellent reference web pages on regular expressions. See, for example, the wikipedia article http://en.wikipedia.org/wiki/Regular_expression and the regular-expressions.info tutorial http://www.regular-expressions.info/tutorial.html

The purpose of this web page is to provide a brief introduction to help in the use of regular expressions in the context of compiler construction and use of editors.

Note: There is significant variation in the details of implementation of regular expressions and the manuals for the systems be used should be carefully consulted.

A regular expression is a string used as a pattern to be matched against other strings. The simplest regular expression is literally the string to be matched against all possible substrings of the strings to be searched, e.g.

mugwump

to successfully match any of the strings "mugwump", "gwumpmugwumpmug", "Have you seen my mugwump?", "mugwumpmugwumpmugwumpmugwumpmugwumpmugwump", but not "mugwum".

Some characters are not handled literally, however. They are metacharacters or operators. The choice of which ones to allow in a particular regular expression parser varies, but the following are commonly used (e.g. in the program lex):

" \ [ ] ^ - ? . * + | ( ) $ / { } % < >

as well as the space character (the blank).

The asterisk * indicates zero or more repetititions, the plus sign + indicates one or more repetitions and the question mark ? indicates 0 or 1 repetitions. The curly brackets are used to list an explicit range of repetitions. The period . matches any characters except the end of line. Thus

.*

matches all the remaining characters on the current line, even if there are none

..

matches the next two characters,

colou?r

matches either 'coulor' or 'colour',

0+

matches 1 or more zeros, and

0{1,6}

matches '0', '00', '000', '0000', '00000' and '000000'.

Outside of square brakets, the caret ^ matches the start of a line and the $ matches the end of a line. In order to use metacharacters as literal text, they must be quoted. Matched pairs of double quotes, "...", around a string of characters will cause them to be taken literally. In addition, in many regular expression systems, the backslash "\" may also be used to quote the next character, except for the commonly used C-escape sequences:

Sequence Meaning
\n The newline or LF character, control-J, (0x0A)
\r The character return or CR character, control-M, (0x0D)
\f The form feed or NP (FF) character, control-L, (0x0C)
\v The vertical tab or VT character, control-K, (0x0B)
\t The horizontal tab or HT character, control-I, (0x09)
\b The backspace or BS character, control-H, (0x08)
\nnn The byte with the bit pattern represented by nnnn in octal. A zero value will not be handled correctly.
\xnnn The byte with the bit pattern represented by nnnn in hexadecimal. A zero value will not be handled correctly.
\a The alert or BEL character, control-G (0x07)
\\ The reverse solidus or backslash itself

The square brackets [ ] are used to identify character classes. Inside a character class most operators lose their special meaning and need not be quoted. Only the -, the ^ and the \ have special meaning, at least in the original lex regular expressions. In later versions, the special digraphs [: and :] also have special meaning within the square brackets, identifying special characer class expressions.

The characters listed between square brackets are alternative pattern matches. Thus [abc] is a match to any of the three characters, 'a', 'b', or ,'c'

If - is the first or last character between square brackets, it has its literal meaning. Otherwise, the - between two characters indicate a range. Thus [0-9] is a match to any of the decimal digits, [0-9A-Fa-f] is a match to any of the hexadecimal digits, [A-Z] is a match to any of the upper case letters and [A-Za-z] is a match to any letter.

If ^ appears between square brackets, but not as the first character it has its literal meaning. If ^ appears as the first character then any character that does not match the rest of the pattern between the square brackets is a match to the character class. Thus [^\n] matches any character other than the newline and [^\001-\037\177] matches any printable ASCII character by excluding all to control characters and DEL. You must use great care in negating expressions if you do not want the newline or other control characters to be a match to the negated expression. In later version of lex, you can include [:cntrl:], as in [^A-Za-z[:cntrl:]] to restrict the negated match to non-control characters.

There are many special character classe expressions that may be used inside square brackets in later versions of lex (e.g. flex):

The flex manual very sensibly points out that the interpretation of locale sensitive character class expressions is done in the environment in which flex itself is run, not the environment in which the lexical scanner it generates is run.

You will sometimes see negation inside a character class expression as in [[:^alpha:]].

In later versions of lex, {-} and {+} are permitted with a character class to indicate the difference and union of two character classes. For example [A-Z{-}I-N] matches the uppercase letters of the alphabet other than 'I", 'J', 'K', 'L', 'M', 'N', which happen to be the intial characters of the default REAL type in Fortran.

The / character separates a pattern to match from a pattern to be matched as a trailing context. Thus if we only want to accept trailing double quote as completing a string if it is followed by trailing whitespace, we might use the pattern \".[^"\r\n]\"/[[:space:]]

Finally, the parentheses () may be used for grouping and the vertical bar may be used to mark alternatives.