Regular Expressions
Regular expressions (called REs, regexes, regexps, regex patterns) are essentially a tiny, highly specialized programming language embedded inside general purpose programming languages (Python, XQuery, javascript). Please note that there are differences in how all general purpose programming languages implement their RE language. When using a RE language, you specify the rules for the set of possible strings that you want to match:
- sentences in a language,
- e-mail addresses,
- TeX commands, or
- anything you like.
It does not replace a parser for XML or HTML since you can easily create invalid and non-wellformed markup.
You can then ask questions such as:
- Does this string match the pattern?, or
- Is there a match for the pattern anywhere in this string?.
You can also use REs to modify a string or to split it apart in various ways.
RE patterns are usually compiled into a series of bytecodes which are then executed by a matching engine. (For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster or just not use too much resources to be useful).
Since a RE language is relatively small and restricted not all possible string processing tasks can be done using REs. There are also tasks that can be done with REs, but the expressions turn out to be very complicated.
In these cases, you may be better off writing code in the programming language, e.g. Python, to do the processing; Usually it is slower than an elaborate RE but probably a lot easier to understand.
Your best use would probably be to assert negative patterns, e.g. things you know are wrong.
Simple patterns
We will start by learning about the simplest possible REs. Since REs are used to operate on strings, we will begin with the most common task: matching characters.
Characters
- Most letters and characters will simply match themselves.
- Though some characters are special metacharacters, and do not match themselves:
- They signal that some out-of-the-ordinary thing should be matched, or
- affect other portions of the RE by repeating them or changing their meaning.
Metacharacters
The complete list of metacharacters:
Metacharacter | Name |
---|---|
. |
period or dot |
^ |
caret |
$ |
dollar sign |
* |
asterisk or star |
+ |
plus sign |
? |
question mark |
{ |
opening curly brace |
} |
closing curly brace |
[ |
opening square bracket |
] |
closing square bracket |
\ |
backslash |
| |
pipe or bar |
( |
opening parenthesis |
) |
closing parenthesis |
Character classes are surrounded by opening square bracket [
and closing square bracket ]
to form a set of characters. Either you specify the characters individually or use ranges by giving a hyphen -
inbetween. Metacharacters are not active inside character classes. Since the character class is a set you can also complement it. To do complementing you give a caret ^
as the first character of the class.
One of the most important metacharacters is the backslash \
which is used to:
- indicate various special sequences
- escape all metacharacters so they can be used in patterns without their special mening, e.g. use
\[
to match an actual opening square bracket[
in the string.
Some of the special sequences beginning with backslash \
represent predefined shorthand sets of characters that are often useful:
- the set of digits,
- the set of letters, or
- the set of anything that is not whitespace.
\w
matches any alphanumeric character. For use with Python this set differs depending on whether the RE pattern is:
- a string,
\w
will match all the characters marked as letters or digits in the Unicode data plus underscore, or - bytes, then this is equivalent to the class
[a-zA-Z0-9_]
.
Special sequence | Matches | Restricted1 equivalent to |
---|---|---|
\d |
any decimal digit | [0-9] |
\D |
any non-digit character | [^0-9] |
\s |
any whitespace character | [ \t\n\r\f\v] |
\S |
any non-whitespace character | [^ \t\n\r\f\v] |
\w |
any alphanumeric character | [a-zA-Z0-9_] |
\W |
any non-alphanumeric character | [^a-zA-Z0-9_] |
1 With Python you can use the more restricted definition of e.g. \w
in a string pattern by supplying the re.ASCII
flag when compiling the regular expression. Otherwise the Unicode character categories are used and thus the sequence sets include a lot more characters.
Sequences can be included inside a character class. E.g. [\s:;]
will match any whitespace character, a colon :
or semicolon ;
.
The final metacharacter in this section is dot .
. It matches anything except a newline character.
(In Python you can use re.DOTALL
to match even a newline. Dot .
is often used where you want to match any character.)
Repetition
Matching varying sets of characters is the first thing REs can do. Another capability is that you can specify that portions of the RE must be repeated, i.e. qualified, a certain number of times.
All four repeating qualifiers:
* + ? {m,n}
The first single metacharacter for repeating things that we will look at is star *
. Star *
does not match the literal character *
. It specifies that the previous character can be matched zero or more times, instead of exactly once. This means whatever is being repeated may not be present at all.
String | RE | Match |
---|---|---|
star | [e]* |
Yes |
staar | t[a]*r |
Yes |
The second repeating metacharacter is plus +
which matches one or more times. This requires at least one occurrence compared to asterisk *
.
String | RE | Match |
---|---|---|
plus | pl[au]+s |
Yes |
pluus | plu+s |
Yes |
plusplus | uss+ |
No |
plussusch | us[cs]+ |
Yes |
The third single repeating qualifiers is the question mark ?
which matches either once or zero times.
String | RE | Match |
---|---|---|
question | qu?e |
Yes |
question | est?s |
Yes |
markka | rk?a |
No |
mark | a?r |
Yes |
The fourth and most complicated repeated qualifier is {m,n}
, where m
and n
are decimal integers. This qualifier means there must be at least m
repetitions, and at most n
repetitions.
String | RE | Match |
---|---|---|
complicated | li{1,1}c |
Yes |
appreciated | p{2,2} |
Yes |
rain | [ai]{2,2} |
Yes |
rain | [ai]{1,2} |
Yes |
complicated | li[act]{3,}ed |
Yes |
If either m
or n
is omitted it becomes for e.g. {3,}
three or more and {,3}
up to three repetitions.
With this qualifier you can express all the single repeating qualifiers, e.g. ?
as {0,1}
+
as {1,}
, and *
as {0,}
but the single versions are both easier on the eye and shorter to write.
Anchors
Anchors match a position not characters.
Metacharacter anchors
Metacharacter | Matches at |
---|---|
^ |
the start of a string |
$ |
the end of a string |
Most RE engines have a multi-line mode that makes caret ^
match after any line break, and dollar_sign $
before any line break.
String | RE | Match |
---|---|---|
complicated | ^comp |
Yes |
appreciated | ed$ |
Yes |
rain | ^rain$ |
Yes |
rain | $r[ai]+n$ |
Yes |
complicated | ^comp.*ed$ |
Yes |
Special sequence anchors
Special sequence | Matches at |
---|---|
\b |
a word boundary |
\B |
not a word boundary |
A word boundary is a position between a character that can be matched by the set of characters of \w
and a character that cannot be matched by \w
. \b
also matches at the ends of the string if the first/last characters in the string are word characters. \B
matches at every position where \b
cannot match.
String | RE | Match |
---|---|---|
complicated | \bcomp |
Yes |
appreciated | \Bed\b |
Yes |
rain | \brain\b |
Yes |
rain | $r[ai]+n\b |
Yes |
complicated | \bcomp.+\b |
Yes |