Regular expressions (regex)
Regular expressions, commonly referred to as regex, are an enhanced version of Unix glob-patterns. They are commonly used by application to locate and/or extract a specific pattern in a string.
Wildcards | Description |
---|---|
x | matches the character 'x' |
. | matches one character |
[abc] | one character which is either a, b, or c. |
[^abc] | any character which is not a, nor b, nor c. |
There are some symbols that are applied to the preceding regex. For instance (a|b)?
means "optionally" "a or b".
Description | |
---|---|
? | optionally matches the preceding element |
* | matches the preceding element 0 or more times |
+ | matches the preceding element 1 or more times |
^regex | matches lines starting with the succeeding regex |
regex$ | matches lines ending with the preceding regex |
(r1|r2|...) | matches one of the provided regexes |
r{n,m} r{n,} r{,m} r{n} | matches if the preceding regex if it is at least present $n$ times and up to $m$ times. Leaves either empty if there is no min/max. The last format means "exactly $n$" times. |
Regex introduced shortcuts for charsets called metacharacters:
-
\d
which is[0-9]
-
\D
which is[^0-9]
-
\w
which is[a-zA-Z0-9_]
-
\W
which is[^a-zA-Z0-9_]
-
\s
which is[\t\n\r\f\v]
-
\S
which is[^\t\n\r\f\v]
Regexes also introduced the notion of capture groups to separately extract a part of the match. These groups are wrapped in \(
and \)
.
-^.{3}.*$
+^\(.{3}\).*$
The process to get the capture group back is implementation specific. It may be $1
or \1
, while it could also be an array like in Python.
$ cat example
abc
de
feghi
# match lines with >=3 chars and replace the tail with "..."
$ sed "s/^(.\{3\}).*$/\1.../g" example
abc...
de
feg...
Practice π§ͺ