Python Regular Expression Tutorial
-
Python Regular Expression
re.match()
Function -
Python Regular Expression
re.search()
Function -
Compile Regular Expressions With
re.complie
-
Flags in Python Regular Expression
re
Module - Checking for Allowed Characters
- Search and Replace
-
the
Findall()
Function -
the
Finditer()
Function -
the
Split()
Function -
Basic Patterns of
re
- Repetition Cases
- Nongreedy Repetition
-
Special Characters and Sequences in
re
-
the
escape
Function -
the
Group()
Function
In this tutorial, you will learn the Regular Expressions and the regular expression operations defined in the re
module in Python. re
is the standard library of Python which supports matching operations of regular expression.
Regular expression in Python is a set of characters or sequence that is used to match a string to another pattern using a formal syntax. You could think of regular expressions as a small programming language that is embedded in Python.
You can use regular expression to define some rules and these rules are then used to create possible strings out of the given string which you want to match the pattern with. Regular expressions in Python are interpreted as a set of instructions.
Python Regular Expression re.match()
Function
You can use the match function to match the RE pattern with the given string. The match function contains flags. Flags define the behavior of a regular expression and can contain different values which you will see later in this tutorial.
The following is the syntax of match function in Python:
re.match(pattern, string, flags)
It has three arguments,
pattern
is the regular expression pattern which is to be matchedstring
is the given string which is to be matched with regular expressionflags
is used to change the behavior of regular expression, and it is optional.
If the matching is performed successfully match
object will be returned else None
will be returned. match
object have further two main methods that are group(num)
and group()
functions. The main purpose to use these functions is to return the match or a specific subsequence and all the subsequences respectively.
Use the re.match
Function
The following example demonstrates how you can use the match
function:
import re
strTest = "Hello Python Programming"
mobj = re.match(r"hello", strTest, re.I)
print(mobj.group())
In this code first of all re
module is imported. Then you will compare a string strTest
with the RE pattern and the value returned from the match function will be assigned to mobj
. The match function is called using re
then inside parenthesis the first argument is the pattern to be matched, and then you will have the given string from which pattern will be matched and also a flag value is passed. Here re.I
is the flag value which means IGNORECASE, so it will be ignored whether the pattern and the string have different case letters (either upper case or lower case).
The output is:
Hello
In this example, the prefix r
is used which tells that the string is a raw string. In a raw string there is no need to write double slashes when using escape sequences for example if you want a backslash then you just have a single \
but not double backslashes \\
as you had in regular strings. This is the only difference between a regular string and a raw string.
Use re.match
Function With Regular String
Consider the example below in which a regular string is used instead of a raw string:
import re
str = "\\tHello Python Programming"
mobj = re.match("\\thello", str, re.I) # no match
str = "\tHello Python Programming"
mobj = re.match("\\thello", str, re.I) # \thello is matching
Python Regular Expression re.search()
Function
You can use the re.search()
function to search the RE pattern in the given string. The search
function contains three arguments in the function the pattern
, given string
, and flags
(optional) respectively.
The following is the syntax of the search function in Python:
re.search(pattern, string, flags)
The following Python code demonstrates the use of search()
function:
import re
str = "Hello Python Programming"
sobj = re.search(r"programming", str, re.I)
print(sobj.group())
Programming
In this code searching for the word programming
is being done. The search
function searches in the entire string. The difference between search and match is that match
function only checks at the beginning of the string whereas search
searches in the entire string.
Searching at the Beginning Using re.search
If you want to search at the beginning of the string then you can use ^
. Consider the following example:
import re
str = "Hello Python Programming"
sobj = re.search(r"^programming", str, re.I)
print(sobj.group()) # no match is found
sobj = re.search(r"^hello", str, re.I)
print(sobj.group()) # matching: Hello
Here ^
will make the search only at the beginning of the string.
Searching at the End by Using re.search
You can also search at the end of the given string. It can be done using $
at the end of the pattern. Consider the code below:
import re
str = "Hello Python Programming"
sobj = re.search(r"programming$", str, re.I)
print(sobj.group()) # matching: Programming
sobj = re.search(r"hello$", str, re.I)
print(sobj.group()) # no match found
Compile Regular Expressions With re.complie
Regular expressions in Python when compiled are converted into patterns. These patterns are actually the pattern objects which contain different functions to perform different tasks which may include searching, matching, and replacing, etc.
When you compile a pattern then you can use that pattern later in the program.
Use Precompiled Patterns
Consider the code below in which the pattern r"\d"
is compiled which means the first digit in the string and then used this pattern to call search function and passed a string in search function. This pattern will be searched in the string provided to search function. Similarly, you can use this precompiled pattern with match function as follows:
import re
compPat = re.compile(r"(\d)")
sobj = compPat.search("Lalalala 123")
print(mobj.group())
mobj = compPat.match("234Lalalala 123456789")
print(mobj.group())
1
2
Flags in Python Regular Expression re
Module
You can use Flags to change the behavior of a regular expression. In a function, flags are optional. You can use flags in two different ways that is by either using the keyword flags
and assigning it flag value or by directly writing the value of the flag. You can have more than one value of flag in the RE literal; this can be done by using bitwise OR
operator |
.
Consider the following table in which some of the commonly used flags are described with Regular Expression literals:
Flag Value | Description |
---|---|
re.I |
This modifier will ignore the case of strings and patterns while matching. |
re.L |
This modifier is used to interpret words with respect to the current locale. |
re.M |
This modifier is used to make $ to match to the end of the line and not to end of string. Similarly, ^ will match at the beginning of the line instead of at the beginning of the string. |
re.S |
This modifier is used to make a dot . to match any character. This includes a newline also. |
re.U |
This modifier is used to interpret the characters as Unicode character set. |
re.X |
It is used to ignore the whitespaces. It will make # as a marker of comment. |
Use Multiple Flag Values
Consider the following Python code in which you will see how to use multiple flag values to change the behavior of RE. Multiple flag values can be included by bitwise OR (|)
operator:
import re
s = re.search("L", "Hello")
print(s) # Output: None, L is there but in small letter and we didn't use flags
s = re.search("L", "Hello", re.I)
print(s) # Output: 1
s = re.search("L", "^Hello", re.I | re.M)
print(s) # Output: 1, searching will be made from the start of line and case is ignored
Checking for Allowed Characters
You can also check if a certain string contains some particular range of characters or not.
Define a Function and Checking Allowed Characters
Consider the following example in which a function is defined and also used precompiled pattern to check if the certain characters are in the passed string or not:
import re
def check(str):
s = re.compile(r"[^A-Z]")
str = s.search(str)
return not bool(str)
print(check("HELLOPYTHON")) # Output: True
print(check("hellopython")) # Output: False
In this function, a pattern that is r'[^A-Z]'
is compiled and used it to search in a string passed when this function named check is called. This function actually checks if the passed string contains letters A-Z
(uppercase) or not. Similarly, it can be seen that when you pass a string in lowercase letters false is returned.
Search and Replace
The re
module provides a function that is sub
function which is used to replace all occurrences of the pattern
in the given string
using the repl
attribute in the function. The characters will be replaced till the count
number is reached. The sub
function will return the updated string.
The following is the syntax of sub function:
re.sub(pattern, repl, string, count=0)
Use sub
Function
Consider the example below in which sub
function replaces the entire string with a given string:
import re
s = "Playing 4 hours a day"
obj = re.sub(r"^.*$", "Working", s)
print(obj)
Working
Here, sub
function is used. The pattern r'^.*$
means starting from the start of the string and then .*
means whatever is in the string till the end $
of the string. Then the argument "Working"
will replace entire string s
.
Use sub
Function to Delete All the Digits From a String
Consider the following example in which sub
function deletes the digits in the given string. For this purpose you can use \d
:
import re
s = "768 Working 2343 789 five 234 656 hours 324 4646 a 345 day"
obj = re.sub(r"\d", "", s)
print(obj)
Working five hours a day
Similarly, you can delete the characters from the string. For this purpose you can use \D
.
import re
s = "768 Working 2343 789 five 234 656 hours 324 4646 a 345 day"
obj = re.sub(r"\D", "", s)
print(obj)
76823437892346563244646345
the Findall()
Function
The findall
function returns a list of all the strings matching to the pattern. The difference between search
and findall
function is that findall
finds all the matches whereas search
finds only the first match. This function finds the non overlapping matches and returns them as a list of strings.
The following is the syntax of findall
function:
findall(pattern, string, flags)
Here pattern
is RE pattern which you will find in given string
with some flags
values for example re.I
to ignore case.
Find All Non-Overlapping Matches
In the following example, findall
finds non-overlapping matches:
import re
str = "Working 6 hours a day. Studying 4 hours a day."
mobj = re.findall(r"[0-9]", str)
print(mobj)
["6", "4"]
r'[0-9]'
is a pattern finding all the digits in the given string and a list of strings is returned (no matter they are digits) which is stored in mobj
.
findall
With Files
You can also use findall
to find in a file. When you use findall
with a file it will return a list of all the matching strings in the file. As read()
function of file will be used so you do not have to iterate through each line of the file using a loop as it returns entire text of file as a string. Consider the following example:
import re
file = open("asd.txt", "r")
mobj = re.findall(r"arg.", file.read())
print(mobj)
file.close()
["arg,", "arg,", "arg,", "argv", "argv", "argv"]
In this example, file is opened first in read mode. The pattern r'arg.'
is matched with the content of the file and you have the list of matching strings in the output.
the Finditer()
Function
The finditer
function can be used to find the RE pattern in strings along with the location of matching strings that is the index of the strings. This function actually iterates through the matching strings and returning the indexes or locations of the string.
The following is the syntax of finditer
function:
finditer(pattern, string, flags)
Iterate Over Matches
The only difference between findall
and finditer
is that finditer
returns the index as well along with matching strings. In the code below, finditer
is used to find the locations of the matching strings while iterating over matches (matching strings) using for loop.
import re
str = "Working 6 hours a day. Studying 4 hours a day."
pat = r"[0-9]"
for mobj in re.finditer(pat, str):
s = mobj.start()
e = mobj.end()
g = mobj.group()
print("{} found at location [{},{}]".format(g, s, e))
6 found at location [8,9]
4 found at location [32,33]
In this example, the pattern is the digits from 0 to 9 to be found in str
. for
loop iterates over the matching strings returned by finditer
. In the loop, functions start
, end
and group
return the start index, ending index and found match respectively in each iteration of the string returned by finditer
.
the Split()
Function
The split
function is used to split a string.
The following is the syntax of split function:
split(patter, string, maxsplit, flags)
Here max
is the total number of string splits. If at most maxsplit
splits occur, the remainder of the string is returned as the final element of the list. The default value of max is 0
which means unlimited splits.
Split a String
split
function returns each word in a string
In the code below, a string is split according to the given pattern and number of max splits.
import re
str = "Birds fly high in the sky for ever"
mobj = re.split("\s+", str, 5)
print(mobj)
["Birds", "fly", "high", "in", "the", "sky for ever"]
In this example, the pattern character \s
is a special character which matches the whitespace character, that is equivalent to [ \t\n\r\f\v]
. Therefore you could have words separated. The value of max is 5
here which makes 6
splits, and the last element is the remainder of the string after the 5th split.
Basic Patterns of re
Regular expressions can specify patterns that are compared to given strings. The following are the basic Patterns of regular expression:
Pattern | Description |
---|---|
^ |
It is used to match at the beginning of the string. |
$ |
This pattern will match at the ending of the string. |
. |
Dot is used to match one character (newline is not included). |
[...] |
It is used to match a single character within brackets. |
[^...] |
This will match a single character but not in brackets. |
* |
0 or more occurrences of preceding re in given string. |
+ |
1 or more occurrences of preceding re in given string. |
? |
0 or 1 occurrences of preceding re in given string. |
{n} |
It will match n number of occurrences in given string. |
{n,} |
It will match n or more than n number of occurrences. |
{n,m} |
This pattern is used to match at least n and at most m matches in the string. |
`a | b` |
(re) |
This pattern is used to group the regular expressions and it will remember the matched text. |
(?imx) |
It will temporarily toggle on i or m or x in RE. When using parenthesis, then only parenthesis area is affected. |
(?-imx) |
It will temporarily toggle off i or m or x in RE. When using parenthesis, then only parenthesis area is affected. |
(?: re) |
This pattern is used to group the regular expressions but it will not remember the matched text. |
(?imx: re) |
It will temporarily toggle on i or m or x in RE inside parenthesis. |
(?-imx: re) |
It will temporarily toggle off i or m or x in RE inside parenthesis. |
(?#...) |
It is a comment. |
(?= re) |
It is used to specify the position by using a pattern. It does not have any range. |
(?! re) |
It is used to specify the position by using a pattern negation. It does not have any range. |
(?> re) |
This pattern is used to match independent pattern. |
\w |
This pattern is used to match words. |
\W |
This pattern is used to match non-words. |
\s |
It will match whitespaces. \s is equal to [ \t\n\r\f] . |
\S |
It will match non-whitespaces. |
\d |
equal to [0-9] . It matches digits in the string. |
\D |
It matches non-digits. |
\A |
match the beginning of the string. |
\Z |
match end of the string. And if there is any newline, it will match before the newline. |
\G |
match to the point where last match was finished. |
\b |
match word boundaries when is outside the brackets but when inside brackets it will match backspace. |
\B |
match non-word boundaries. |
\n, \t, etc. |
\n is used to match newlines, \t will match tab and so on. |
\1...\9 |
This pattern will match nth subexpression (grouped). |
\10 |
\10 usually matches the nth subexpression (grouped) if match is already done. If match is not already done \10 will provide octal representation of a character code. |
Repetition Cases
The following table demonstrates some examples of repetition cases with description:
Examples | Descriptions |
---|---|
ab? |
It will match either a or ab. |
ab* |
ab* will match ab and a’s and any a’s followed by any b’s. |
ab+ |
ab+ means a’s followed by b’s and not only a. a must be followed by non zero b. |
\d{2} |
It will match exactly 2 digits. |
\d{2,} |
It will match 2 or more digits. |
\d{2,4} |
It will match the digits 2, 3 and 4. |
Nongreedy Repetition
In regular expressions, repetition is by default greedy which tries to match as many repetitions as possible.
The qualifiers such as *
, +
and ?
are greedy qualifiers. When you use .*
, it will perform a greedy match and will match the entire string resulting in matching as many characters as possible. Consider the code below:
import re
mobj = re.match(r".*", "Birds fly high in sky")
print(mobj.group())
Birds fly high in the sky
So you can see here the entire string is matched.
When you add ?
with .+
, you will have a non greedy re, and the pattern .+?
will match as few characters as possible in the string.
import re
mobj = re.match(r".*", "Birds fly high in sky")
print(mobj.group())
The result is the first character of the string
B
Special Characters and Sequences in re
Special characters in re
start with a \
. For example, we have \A
which will match from the beginning of the string.
These special characters are described in the table above.
In this section, you will be demonstrated the examples of some of the special characters:
import re
str = "Birds fly high in the sky"
# \A
# OUTPUT: B, here \A will match at beginning only.
mobj = re.match(r"\Ab", str, re.I)
# \d
mobj = re.match(r"\d", "4 birds are flying") # OUTPUT: 4
# \s
mobj = re.split("\s+", "birds fly high in the sky", 1) # OUTPUT: ['Birds', 'fly']
the escape
Function
The escape
function is used to escape all the characters from the string. The ASCII letters, numbers, and _
will not be escaped. The escape
function is used when you want to extract metacharacters from a string.
Following is the syntax of escape function:
escape(pattern)
In the following example, a string www.python.org
is passed to escape function. In this we have .
which is a metacharacter and it will be extracted or matched:
print(re.escape("www.python.org"))
www\.python\.org
Here .
is a metacharacter which is extracted or matched. Whenever a metacharacter is matched using escape function you will have \
before the character.
Escape Special Characters
The characters like brackets [
and ]
cannot be matched. Consider the following example:
import re
mobj = re.search(r"[a]", "[a]b")
print(mobj.group())
a
Here you can see that brackets [
and ]
are not matched.
You can match them by using the escape function:
import re
mobj = re.search(r"\[a\]", "[a]b")
print(mobj.group())
[a]b
the Group()
Function
The group
function is used to return one or more subgroups of the found match. The group
function can have some arguments.
The following is the syntax of group function:
group(group1, group2, ..., groupN)
If you have a single argument in group function, the result will be a single string but when you have more than one arguments, then the result will be a tuple (containing one item per argument).
When there is no argument, by default argument will be zero and it will return the entire match.
When the argument groupN
is zero, the return value will be entire matching string.
When you specify the group number or argument as a negative value or a value larger than the number of groups in pattern then IndexError
exception will occur.
Consider the code below in which there is no argument in group
function which is equivalent to group(0)
.
import re
str = "Working 6 hours a day"
mobj = re.match(r"^.*", str)
print(mobj.group())
Working 6 hours a day
Here group()
is used and you have the entire matched string.
Picking Parts of Matching Texts
In the following example, group
function is used with arguments to pick up matching groups:
import re
a = re.compile("(p(q)r)s")
b = a.match("pqrs")
print(b.group(0))
print(b.group(1))
print(b.group(2))
pqrs
pqr
q
Here group(0)
returns the entire match. group(1)
will return the first match which is pqr
and group(2)
will return the second match which is q
.
Named Groups
Using named groups you can create a capturing group. This group can be referred by the name then. Consider the example below:
import re
mobj = re.search(r"Hi (?P<name>\w+)", "Hi Roger")
print(mobj.group("name"))
Roger
Non-Capturing Groups
Non-capturing group can be created using ?:
. Non-capturing group is used when you do not want the content of the group.
import re
mobj = re.match("(?:[pqr])+", "pqr")
print(mobj.groups())
()
Founder of DelftStack.com. Jinku has worked in the robotics and automotive industries for over 8 years. He sharpened his coding skills when he needed to do the automatic testing, data collection from remote servers and report creation from the endurance test. He is from an electrical/electronics engineering background but has expanded his interest to embedded electronics, embedded programming and front-/back-end programming.
LinkedIn Facebook