REGULAR EXPRESSIONS A regular expression, on the one hand, is a string like any other; a sequence of characters. On the other hand, special characters within the string have certain functions which make regular expressions useful when trying to match portions of other strings. In the following discussion and examples, a string containing a regular expression will be called the ``pattern'', and the string against which it is to be matched is called the ``reference string''. For example, regular expressions allow one to search for: "all strings ending with the letters 'ize'" or: "all strings beginning with a number between 1 and 3 and ending in a comma". In order to accomplish this, regular expressions co-opt the use of some characters to have special meaning. They also provide for these characters to lose their special meaning if the user so desires. The rules for regular expresssion are: RULES c Any character c matches itself unless it has been assigned other special meaning as listed below. Most special characters can be escaped (made to lose its special meaning), by placing the character '\' in front of it. This doesn't apply to '{' which is non-special until it is escaped. Thus although '*' normally has special meaning the string '\*' matches itself. Example: The pattern acdef matches s83acdeffff or acdefsecs or acdefsecs but not accdef or aacde1f That is it will any string that contains ``acdef'' any- where in the reference string. Example: Normally the characters '*' and '$' are special, but the pattern a\*bse\$ acts as above. That is any reference string containing ``*abse$'' as a substring will be flagged as a match. . A period matches any character except the newline character. This is known as the wildcard character. Example: The pattern .... will match any 4 characters in the reference string, except a newline character. ^ If `^' appears at the begining of the pattern then it is said to ``anchor'' the match to the beginning of the line. That is, the reference string must start with the pattern following the `^'. If this character appears anywhere else other than at the beginning of the line, then it is no longer considered special, and matches itself as any non-special character would. Similarly if it starts a string but is escaped, it matches itself. Example: The pattern ^efghi Will match efghi or efghijlk but not abcefghi That is the pattern will match only those reference strings starting with ``efghi''. Just containing the substring is not sufficient. $ Occurring at the end of the pattern, this character ``anchors'' the pattern to the end of the line (refer- ence string). A '$' occurring anywhere else in the pat- tern is regarded as a non-special. Similarly if it is at the end of the pattern but is escaped, it is non- special. Example: The pattern efghi$ Will match efghi or abcdefghi but not efghijkl That is the pattern will match only those reference strings ending with ``efghi''. Just containing the sub- string is not sufficient. \< This sequence in the pattern causes the one character regular expression following it only to match something at the beginning of a word: the beginning of a line or just before a letter, digit or underline character, or just after a charcter which is not one of these. Example: The pattern \Constrains the one-character regular expression fol- lowing it to be at the end of a ``word'' as defined above. [string] One or more characters within square brackets. This pattern matches any single character within the brack- ets. The caret, '^', has a special meaning if it is the first character in the series: the pattern will match any character other than one in the list. Example: The pattern [^abc] Will match any character except 'a', 'b' or 'c'. To match a right bracket, ']', in the list it must be put first: []ab01] For a caret, '^', in the list it can appear anywhere but first. In [ab^01] the caret loses its special meaning. The '-' character is special within square brackets. It is interpreted as a range of characters (in the ASCII character set) and will match any single character within that range. '[a-z]' matches any lower case letter. The '-' can be made non special by placing it first or last within the square brackets. The characters '$', '*' and '.' are not special within square brackets. Example: The pattern [ab01] matches a single occurence of a character from the set 'a', 'b', '0', '1'. Example: The pattern [^ab01] will match any single character other than 'a', 'b', '0', '1'. Example : The pattern [a0-9b] which matches one of 'a', 'b' or a digit between 0 and 9 inclusive. Example : The pattern [^a0-9b.$] means any single character not 'a', 'b' '.' , '$' or a digit between 0 and 9 inclusive. * An asterisk following a regular expression in the pat- tern has the effect of matching zero or more occurrences of that expression. Example: The pattern a* means zero or more occurrences of the character 'a'. Example: The pattern [A-Z]* means zero or more occurrences of the upper case alpha- bet. \{m\} \{m,\} \{m,n\} A one-character regular expression followed by one of the three of these constructions causes a range of occurrences of that regular expression to be matched. If it is followed by \{m\} where m is a non-negative integer between 0 and 255 (inclusive), then exactly m occurrences of that regular expression are matched. If followed by \{m,\}, then at least m occurrences are matched. Finally, if it is followed by \{m,n\} (where n is a non-negative integer between 0 and 255 and where n > m), then between m and n occurrences of the expres- sion are matched. Example: The pattern ab\{3\} would match any substring in the reference string of an 'a' followed by exactly 3 'b's. Example: The pattern ab\{3,\} would match any substring in the reference string of an 'a' followed by at least 3 'b's. Example: The pattern ab\{3,5\} would match any substring in the reference string of an 'a' followed by at least 3 but at most 5 'b's. Common Problems with Regular Expression (1) When matching a substring it is not necessary to use the wildcard character to match the part of the refer- ence string preceeding and following the substring. Example: The pattern abcd will match any reference string containing this pat- tern. It is not necessary to use .*abcd.* as the pattern. (2) In order to constrain a pattern to the entire reference pattern, use the the construction: ^pattern$ (3) The easiest way to obtain case insensitivity in a regu- lar expression is to use the '[]' operator. For exam- ple, a pattern to match the word ``hello'' regarless of the case of the letters would be: [Hh][Ee][Ll][Ll][Oo]