Regular expressions (regex)

catregex

Regular expressions (expressions régulières), commonly referred to as regex, are an enhanced version of glob-patterns and are present in most, if not all languages.

🎯 Everything learned in glob-patterns is still available in regexes, so it won't be covered. ⚠️ The symbol for "one character" is now . (dot) and not ? (question mark) which was given a new meaning.

Practice πŸ§ͺ


Enhancements

Some new symbols were introduced:

Description
x?an optional character 'x'
x+at least x times this character
^xlines starting with x
x$lines ending with x
(x|y)either x or y
x{n,m}
x{n,}
x{,m}
x{n}
at least $n$ times x, up to $m$ times, leave either empty if no limit.
The last one means "exactly $n$" times.

Regex introduced metacharacters which are shortcuts to these charsets:

  • \d which is [0-9]
  • \D which is [^0-9]
  • \w which is [a-zA-Z0-9_]
  • \W which is [^a-zA-Z0-9_]
  • \s which is [\t\n\r\f\v]
  • \S which is [^\t\n\r\f\v]

You can apply a symbol to a group by wrapping it inside a parenthesis. For instance, (ab)+(cd|e)?.

Many languages and commands allow us to use capture groups 🚩. These are groups wrapped in \( and \). It allows us to extract parts of the matching regex, usually because we used a regex to extract data.

For instance, this regex matches a line with at least 3 characters (.{3}), followed by a possibly empty string (.*). To know what the three characters are, we can wrap them up in a capture group.

^\(.{3}\).*$

The process to get the capture group back is different everywhere. It may be $1 or \1, while it could also be an array returned by a function.

$ echo -e "abc\nde\nfeghi\nklmnop" > example
$ cat example
abc
de
feghi
klmnop
# replace the match with the first 3 followed by "..."
$ sed "s/^(.\{3\}).*$/\1.../g" example
abc...
de
feg...
klm...