View on GitHub

NEH Institute materials

July 2017

Regular Expressions

Regular expressions (called REs, regexes, regexps, regex patterns) are essentially a tiny, highly specialized programming language embedded inside general purpose programming languages (Python, XQuery, javascript). Please note that there are differences in how all general purpose programming languages implement their RE language. When using a RE language, you specify the rules for the set of possible strings that you want to match:

sentences in a language,
e-mail addresses,
TeX commands, or
anything you like.

It does not replace a parser for XML or HTML since you can easily create invalid and non-wellformed markup.

You can then ask questions such as:

Does this string match the pattern?, or
Is there a match for the pattern anywhere in this string?.

You can also use REs to modify a string or to split it apart in various ways.

RE patterns are usually compiled into a series of bytecodes which are then executed by a matching engine. (For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster or just not use too much resources to be useful).

Since a RE language is relatively small and restricted not all possible string processing tasks can be done using REs. There are also tasks that can be done with REs, but the expressions turn out to be very complicated.

In these cases, you may be better off writing code in the programming language, e.g. Python, to do the processing; Usually it is slower than an elaborate RE but probably a lot easier to understand.

Your best use would probably be to assert negative patterns, e.g. things you know are wrong.

Simple patterns

We will start by learning about the simplest possible REs. Since REs are used to operate on strings, we will begin with the most common task: matching characters.

Characters

Most letters and characters will simply match themselves.
Though some characters are special metacharacters, and do not match themselves:
- They signal that some out-of-the-ordinary thing should be matched, or
- affect other portions of the RE by repeating them or changing their meaning.

Metacharacters

The complete list of metacharacters:

Metacharacter	Name
`.`	period or dot
`^`	caret
`$`	dollar sign
`*`	asterisk or star
`+`	plus sign
`?`	question mark
`{`	opening curly brace
`}`	closing curly brace
`[`	opening square bracket
`]`	closing square bracket
`\`	backslash
`\|`	pipe or bar
`(`	opening parenthesis
`)`	closing parenthesis

Character classes are surrounded by opening square bracket [ and closing square bracket ] to form a set of characters. Either you specify the characters individually or use ranges by giving a hyphen - inbetween. Metacharacters are not active inside character classes. Since the character class is a set you can also complement it. To do complementing you give a caret ^ as the first character of the class.

One of the most important metacharacters is the backslash \ which is used to:

indicate various special sequences
escape all metacharacters so they can be used in patterns without their special mening, e.g. use \[ to match an actual opening square bracket [ in the string.

Some of the special sequences beginning with backslash \ represent predefined shorthand sets of characters that are often useful:

the set of digits,
the set of letters, or
the set of anything that is not whitespace.

\w matches any alphanumeric character. For use with Python this set differs depending on whether the RE pattern is:

a string, \w will match all the characters marked as letters or digits in the Unicode data plus underscore, or
bytes, then this is equivalent to the class [a-zA-Z0-9_].

Special sequence	Matches	Restricted¹ equivalent to
`\d`	any decimal digit	`[0-9]`
`\D`	any non-digit character	`[^0-9]`
`\s`	any whitespace character	`[ \t\n\r\f\v]`
`\S`	any non-whitespace character	`[^ \t\n\r\f\v]`
`\w`	any alphanumeric character	`[a-zA-Z0-9_]`
`\W`	any non-alphanumeric character	`[^a-zA-Z0-9_]`

¹ With Python you can use the more restricted definition of e.g. \w in a string pattern by supplying the re.ASCII flag when compiling the regular expression. Otherwise the Unicode character categories are used and thus the sequence sets include a lot more characters.

Sequences can be included inside a character class. E.g. [\s:;] will match any whitespace character, a colon : or semicolon ;.

The final metacharacter in this section is dot .. It matches anything except a newline character. (In Python you can use re.DOTALL to match even a newline. Dot . is often used where you want to match any character.)

Repetition

Matching varying sets of characters is the first thing REs can do. Another capability is that you can specify that portions of the RE must be repeated, i.e. qualified, a certain number of times.

All four repeating qualifiers:

* + ? {m,n}

The first single metacharacter for repeating things that we will look at is star *. Star * does not match the literal character *. It specifies that the previous character can be matched zero or more times, instead of exactly once. This means whatever is being repeated may not be present at all.

String	RE	Match
star	`[e]*`	Yes
staar	`t[a]*r`	Yes

The second repeating metacharacter is plus + which matches one or more times. This requires at least one occurrence compared to asterisk *.

String	RE	Match
plus	`pl[au]+s`	Yes
pluus	`plu+s`	Yes
plusplus	`uss+`	No
plussusch	`us[cs]+`	Yes

The third single repeating qualifiers is the question mark ? which matches either once or zero times.

String	RE	Match
question	`qu?e`	Yes
question	`est?s`	Yes
markka	`rk?a`	No
mark	`a?r`	Yes

The fourth and most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n repetitions.

String	RE	Match
complicated	`li{1,1}c`	Yes
appreciated	`p{2,2}`	Yes
rain	`[ai]{2,2}`	Yes
rain	`[ai]{1,2}`	Yes
complicated	`li[act]{3,}ed`	Yes

If either m or n is omitted it becomes for e.g. {3,} three or more and {,3} up to three repetitions.

With this qualifier you can express all the single repeating qualifiers, e.g. ? as {0,1} + as {1,}, and * as {0,} but the single versions are both easier on the eye and shorter to write.

Anchors

Anchors match a position not characters.

Metacharacter anchors

Metacharacter	Matches at
`^`	the start of a string
`$`	the end of a string

Most RE engines have a multi-line mode that makes caret ^ match after any line break, and dollar_sign $ before any line break.

String	RE	Match
complicated	`^comp`	Yes
appreciated	`ed$`	Yes
rain	`^rain$`	Yes
rain	$r[ai]+n$	Yes
complicated	`^comp.*ed$`	Yes

Special sequence anchors

Special sequence	Matches at
`\b`	a word boundary
`\B`	not a word boundary

A word boundary is a position between a character that can be matched by the set of characters of \w and a character that cannot be matched by \w. \b also matches at the ends of the string if the first/last characters in the string are word characters. \B matches at every position where \b cannot match.

String	RE	Match
complicated	`\bcomp`	Yes
appreciated	`\Bed\b`	Yes
rain	`\brain\b`	Yes
rain	`$r[ai]+n\b`	Yes
complicated	`\bcomp.+\b`	Yes