View on GitHub

NEH Institute materials

July 2017

Home | Admin | Week 1 | Week 2 | Week 3 | Misc

Regular Expressions

Regular expressions (called REs, regexes, regexps, regex patterns) are essentially a tiny, highly specialized programming language embedded inside general purpose programming languages (Python, XQuery, javascript). Please note that there are differences in how all general purpose programming languages implement their RE language. When using a RE language, you specify the rules for the set of possible strings that you want to match:

It does not replace a parser for XML or HTML since you can easily create invalid and non-wellformed markup.

You can then ask questions such as:

You can also use REs to modify a string or to split it apart in various ways.

RE patterns are usually compiled into a series of bytecodes which are then executed by a matching engine. (For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster or just not use too much resources to be useful).

Since a RE language is relatively small and restricted not all possible string processing tasks can be done using REs. There are also tasks that can be done with REs, but the expressions turn out to be very complicated.

In these cases, you may be better off writing code in the programming language, e.g. Python, to do the processing; Usually it is slower than an elaborate RE but probably a lot easier to understand.

Your best use would probably be to assert negative patterns, e.g. things you know are wrong.

Simple patterns

We will start by learning about the simplest possible REs. Since REs are used to operate on strings, we will begin with the most common task: matching characters.

Characters

Metacharacters

The complete list of metacharacters:

Metacharacter Name
. period or dot
^ caret
$ dollar sign
* asterisk or star
+ plus sign
? question mark
{ opening curly brace
} closing curly brace
[ opening square bracket
] closing square bracket
\ backslash
| pipe or bar
( opening parenthesis
) closing parenthesis

Character classes are surrounded by opening square bracket [ and closing square bracket ] to form a set of characters. Either you specify the characters individually or use ranges by giving a hyphen - inbetween. Metacharacters are not active inside character classes. Since the character class is a set you can also complement it. To do complementing you give a caret ^ as the first character of the class.

One of the most important metacharacters is the backslash \ which is used to:

Some of the special sequences beginning with backslash \ represent predefined shorthand sets of characters that are often useful:

\w matches any alphanumeric character. For use with Python this set differs depending on whether the RE pattern is:

Special sequence Matches Restricted1 equivalent to
\d any decimal digit [0-9]
\D any non-digit character [^0-9]
\s any whitespace character [ \t\n\r\f\v]
\S any non-whitespace character [^ \t\n\r\f\v]
\w any alphanumeric character [a-zA-Z0-9_]
\W any non-alphanumeric character [^a-zA-Z0-9_]

1 With Python you can use the more restricted definition of e.g. \w in a string pattern by supplying the re.ASCII flag when compiling the regular expression. Otherwise the Unicode character categories are used and thus the sequence sets include a lot more characters.

Sequences can be included inside a character class. E.g. [\s:;] will match any whitespace character, a colon : or semicolon ;.

The final metacharacter in this section is dot .. It matches anything except a newline character. (In Python you can use re.DOTALL to match even a newline. Dot . is often used where you want to match any character.)

Repetition

Matching varying sets of characters is the first thing REs can do. Another capability is that you can specify that portions of the RE must be repeated, i.e. qualified, a certain number of times.

All four repeating qualifiers:

* + ? {m,n}

The first single metacharacter for repeating things that we will look at is star *. Star * does not match the literal character *. It specifies that the previous character can be matched zero or more times, instead of exactly once. This means whatever is being repeated may not be present at all.

String RE Match
star [e]* Yes
staar t[a]*r Yes

The second repeating metacharacter is plus + which matches one or more times. This requires at least one occurrence compared to asterisk *.

String RE Match
plus pl[au]+s Yes
pluus plu+s Yes
plusplus uss+ No
plussusch us[cs]+ Yes

The third single repeating qualifiers is the question mark ? which matches either once or zero times.

String RE Match
question qu?e Yes
question est?s Yes
markka rk?a No
mark a?r Yes

The fourth and most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n repetitions.

String RE Match
complicated li{1,1}c Yes
appreciated p{2,2} Yes
rain [ai]{2,2} Yes
rain [ai]{1,2} Yes
complicated li[act]{3,}ed Yes

If either m or n is omitted it becomes for e.g. {3,} three or more and {,3} up to three repetitions.

With this qualifier you can express all the single repeating qualifiers, e.g. ? as {0,1} + as {1,}, and * as {0,} but the single versions are both easier on the eye and shorter to write.

Anchors

Anchors match a position not characters.

Metacharacter anchors

Metacharacter Matches at
^ the start of a string
$ the end of a string

Most RE engines have a multi-line mode that makes caret ^ match after any line break, and dollar_sign $ before any line break.

String RE Match
complicated ^comp Yes
appreciated ed$ Yes
rain ^rain$ Yes
rain $r[ai]+n$ Yes
complicated ^comp.*ed$ Yes

Special sequence anchors

Special sequence Matches at
\b a word boundary
\B not a word boundary

A word boundary is a position between a character that can be matched by the set of characters of \w and a character that cannot be matched by \w. \b also matches at the ends of the string if the first/last characters in the string are word characters. \B matches at every position where \b cannot match.

String RE Match
complicated \bcomp Yes
appreciated \Bed\b Yes
rain \brain\b Yes
rain $r[ai]+n\b Yes
complicated \bcomp.+\b Yes