Regular expressions (regex)

catregex adventofcyber4

Regular expressions, commonly referred to as regex, are an enhanced version of Unix glob-patterns. They are commonly used by application to locate and/or extract a specific pattern in a string.

WildcardsDescription
xmatches the character 'x'
.matches one character
[abc]one character which is either a, b, or c.
[^abc]any character which is not a, nor b, nor c.

There are some symbols that are applied to the preceding regex. For instance (a|b)? means "optionally" "a or b".

Description
?optionally matches the preceding element
*matches the preceding element 0 or more times
+matches the preceding element 1 or more times
^regexmatches lines starting with the succeeding regex
regex$matches lines ending with the preceding regex
(r1|r2|...)matches one of the provided regexes
r{n,m}
r{n,}
r{,m}
r{n}
matches if the preceding regex if it is at least present $n$ times and up to $m$ times.
Leaves either empty if there is no min/max.
The last format means "exactly $n$" times.

Regex introduced shortcuts for charsets called metacharacters:

  • \d which is [0-9]
  • \D which is [^0-9]
  • \w which is [a-zA-Z0-9_]
  • \W which is [^a-zA-Z0-9_]
  • \s which is [\t\n\r\f\v]
  • \S which is [^\t\n\r\f\v]

Regexes also introduced the notion of capture groups to separately extract a part of the match. These groups are wrapped in \( and \).

-^.{3}.*$
+^\(.{3}\).*$

The process to get the capture group back is implementation specific. It may be $1 or \1, while it could also be an array like in Python.

$ cat example
abc
de
feghi
# match lines with >=3 chars and replace the tail with "..."
$ sed "s/^(.\{3\}).*$/\1.../g" example
abc...
de
feg...

Practice πŸ§ͺ