Regular expressions (regex)
Regular expressions (expressions régulières
), commonly referred to as regex, are an enhanced version of glob-patterns and are present in most, if not all languages.
π― Everything learned in glob-patterns is still available in regexes, so it won't be covered. β οΈ The symbol for "one character" is now .
(dot) and not ?
(question mark) which was given a new meaning.
Enhancements
Some new symbols were introduced:
Description | |
---|---|
x? | an optional character 'x' |
x+ | at least x times this character |
^x | lines starting with x |
x$ | lines ending with x |
(x|y) | either x or y |
x{n,m} x{n,} x{,m} x{n} | at least $n$ times x, up to $m$ times, leave either empty if no limit. The last one means "exactly $n$" times. |
Regex introduced metacharacters which are shortcuts to these charsets:
-
\d
which is[0-9]
-
\D
which is[^0-9]
-
\w
which is[a-zA-Z0-9_]
-
\W
which is[^a-zA-Z0-9_]
-
\s
which is[\t\n\r\f\v]
-
\S
which is[^\t\n\r\f\v]
You can apply a symbol to a group by wrapping it inside a parenthesis. For instance, (ab)+(cd|e)?
.
Many languages and commands allow us to use capture groups π©. These are groups wrapped in \(
and \)
. It allows us to extract parts of the matching regex, usually because we used a regex to extract data.
For instance, this regex matches a line with at least 3 characters (.{3}
), followed by a possibly empty string (.*
). To know what the three characters are, we can wrap them up in a capture group.
^\(.{3}\).*$
The process to get the capture group back is different everywhere. It may be $1
or \1
, while it could also be an array returned by a function.
$ echo -e "abc\nde\nfeghi\nklmnop" > example
$ cat example
abc
de
feghi
klmnop
# replace the match with the first 3 followed by "..."
$ sed "s/^(.\{3\}).*$/\1.../g" example
abc...
de
feg...
klm...