Regex Isn't Hard
Regex gets a bad reputation for being very complex. That’s fair, but I also think that if you focus on a certain core subset of regex, it’s not that hard. Most of the complexity comes from various “shortcuts” that are hard to remember. If you ignore those, the language itself is fairly small and portable across programming languages.
It’s worth knowing regex because you can get A LOT done in very little code. If I try to replicate what my regex does using normal procedural code, it’s often very verbose, buggy and significantly slower. It often takes hours or days to do better than a couple minutes of writing regex.
NOTE: Some languages, like Rust, have parser combinators which can be as good or better than regex in most of the ways I care about. However, I often opt for regex anyway because it’s less to fit in my brain. There’s a single core subset of regex that all major programming languages support.
There’s four major concepts you need to know
- Character sets
- Repetition
- Groups
- The
|
,^
and$
operators
Here I’ll highlight a subset of the regex language that’s not hard to understand or remember. Throughout I’ll also tell you what to ignore. Most of these things are shortcuts that save a little verbosity at the expense of a lot of complexity. I’d rather verbosity than complexity, so I stick to this subset.
Character Sets
A character set is the smallest unit of text matching available in regex. It’s just one character.
Single characters
a
matches a single character, always lowercase a
. aaa
is 3 consecutive character sets, each matches only a
. Same
with abc
, but the second and third match b
and c
respectively.
Ranges
Match one of a set of characters.
[a]
— same as justa
[abc]
— Matchesa
,b
, orc
.[a-c]
— Same, but using-
to specify a range of characters[a-z]
— any lowercase character[a-zA-Z]
— any lowercase or uppercase character[a-zA-Z0-9!@#$%^&*()-]
— alphanumeric plus any of these symbols:!@#$%^&*()-
Note in that last point how -
comes last. Also note that ^
isn’t the first character in the range, the ^
can become an
operator if it occurs as the first character in a character set or regex.
There’s a parallel to boolean logic here:
ab
means “a
ANDb
”[ab]
meansa
ORb
”
You can build more complex logic using groups and negation.
Negation (^
)
I mention this operator later, but in the context of character sets, it means “everything but these”.
Example:
[^ab]
means “everything buta
orb
[ab^]
means “a
,b
or^
. The^
has to be the first character to have special meaning.
[Ignore this stuff]
These things are unnecessarily complex. They save some verbosity at the expense of a lot of complexity.
\w
,\s
, etc. — These are shortcuts for ranges like[a-zA-Z0-9]
. Ignore them because they’re not portable. Most programming languages have them to some extent, but they’re hard to remember. Some languages use different syntax, like:word:
, which is almost as long as writing it out explicitly..
— The dot (.
) matches any character, but not always. Sometimes it doesn’t match newlines. In some programming languages it never matches newlines. I’ve gotten bitten too often by the.
not behaving like I think it should. It’s best to ignore this entirely. Instead, use a range negation, like[^%]
if you know the%
character won’t show up. It doesn’t hurt to be a little more explicit.
Repetition
These operators change the immediately previous character set to match a certain number of times:
?
— zero or one*
— zero or more+
— one or more
All these also work on entire groups as well.
[Ignore this stuff]
These are unnecessarily complex. You can accomplish the same thing through other means.
- Non-greedy matching,
*?
and+?
. This comes up a lot when you use the.
character set. Instead, you can usually use a stricter negation character set like[^%]
. - Repetition ranges, i.e.
{1,2}
. Just duplicate your pattern or use?
or*
on the group.
Groups
A group is basically a sub-regex. There’s three common uses for groups:
1. Repeat a sub-pattern
e.g. This pattern ([0-9][0-9]?[0-9]][.])+
matches one, two or three digits followed by a .
and also matches
repeated patterns of this. This wold match an IP address (albeit not strictly).
2. Substitutions
The most common regex operations are match and substitute. However, the API for subtitution varies quite a bit depending on the host langauge.
- Methods — in C#, Java, Python, etc. there’s typically a method or function named something like
sub
,substitute
orreplace
. sed
style — in sed, Perl, and bash it flows likes/pattern/replacement/
, where the leadings
means to “substitute”.
In both cases you can use $1
or \1
. Lookup in the docs for which is appropriate.
3. Extract text
You can extract the text that the group matches.
0
— the entire regex match1
-∞ — the text matched by the 1-indexed group. The first set of parentheses is group1
, seconnd is2
, etc.
The non-portable part is that the API for accessing groups is almost always different in every programming language. Still, group extraction is extremely useful, so just look it up.
The most common APIs look like:
Match.group(1)
— Python, C#, Java, etc. offer a method from the main programming language to extract a group from a match object. The exact method name is usually some something likegroup
orgetGroup
.$1
— Perl will set variables like$1
and$2
in the local scope. Most programming languages can’t do this, but you’ll see the syntax come up, e.g. with replacements often you can use either$1
or\1
in the substitution text.
If those APIs don’t exist, or if you don’t feel like remembering it, you can replicate extraction via subtitution. For example,
in Python you can do re.sub("([^\n]*\\.foo)[^\n]*", "$1", input_str)
to extract the first group
[Ignore this stuff]
There are some operators at the beginning of groups, like (?:
that can mean various things like “non-capturing group” or
“look-ahead” or “look-behind”. These are fairly advanced and you can generally get away without knowing about them.
The, |
, ^
and $
Operators
The |
operator is OR, but for entire regex or groups.
foo|bar
matches eitherfoo
orbar
(foo|bar)+
adds some repetition on it, e.g. it matchesbarfoobarfoo
The ^
is only ever significant when it’s the first character:
- First in the pattern — match starting at the beginning of the string or line. e.g.
^foo
will matchfoobar
but notbarfoo
.- WARNING: Some regex APIs always behave like the pattern is always surrounded by
^
and$
. You can test for this pretty easily with trial and error.
- WARNING: Some regex APIs always behave like the pattern is always surrounded by
- First in character set — negation, match everything but those characters
The $
character only ever means “the end” and it’s only used in top-level regex.
Conclusion
It’s not a bad idea to always only stick to this subset of regex because it’s mostly portable across programming languages. That means less things to remember, so you get a lot of “bang for the buck” in terms of jamming info into your brain. The quirks that do exist are relatively few, and are usually worth the effort because of the value they provide.
Regarding portability — most modern implementations try to copy some subset of Perl regex. The subset I’ve outlined here is
pretty consistent accross the major programming languages of today. However, you might run into some surprises if you’re using
old tools like sed
and grep
that were created around the same time Perl was developing the idea of regex. Newer implementations
are reasonaby stable though.
Too often people entirely reject regex, which is a shame because it’s an incredibly powerful language for text processing. A little bit of regex knowledge goes a very long way. I hope this helps!