Python Regular Expression to Match a Multiline Block of Text
This article discusses ways to search for a specific pattern in multiline strings. The solution compromises several approaches for known and unknown patterns and explains how the matching patterns work.
Reason to Write Regex to Match Multiline String
Suppose that we have the following block of text:
Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n
\n
IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.
From the text block given above, it is required to find the starting text, and the text is presented a few lines below. It is important to note that \n
symbolizes a newline and is not literal text.
To sum it up, we want to find and match text across multiple lines, ignoring any empty lines which may come in between the text. In the case of the text mentioned above, it should return the Any compiled body....
line and the IBM first used the term....
line in a single regular expression query.
Possible Solutions to Match Multiline String
Before discussing the solutions to this particular problem, it is essential to understand the different aspects of the regex (regular expression) API, particularly those used frequently throughout the solution.
So, let’s start with the re.compile()
.
Python re.compile()
Method
re.compile()
compiles a regex pattern into a regular expression object that we can use for matching with match()
, search()
, and other described methods.
One advantage of re.compile()
over uncompiled patterns is reusability. We can use the compiled expression multiple times instead of declaring a new string for each uncompiled pattern.
import re as regex
pattern = regex.compile(".+World")
print(pattern.match("Hello World!"))
print(pattern.search("Hello World!"))
Output:
<re.Match object; span=(0, 11), match='Hello World'>
<re.Match object; span=(0, 11), match='Hello World'>
Python re.search()
Method
re. search()
searches a string for a match and returns a Match
object if one is found. If many matches exist, we will return the first instance.
We can also use it directly without the usage of re.compile()
, applicable when only one query is required to be made.
import re as regex
print(regex.search(".+World", "Hello World!"))
Output:
<re.Match object; span=(0, 11), match='Hello World'>
Python re.finditer()
Method
re.finditer()
matches a pattern within a string and returns an iterator that delivers Match
objects for all non-overlapping matches.
We can then use the iterator to iterate over the matches and perform the necessary operations; the matches are ordered in the way they are found, from left to right in the string.
import re as regex
matches = regex.finditer(r"[aeoui]", "vowel letters")
for match in matches:
print(match)
Output:
<re.Match object; span=(1, 2), match='o'>
<re.Match object; span=(3, 4), match='e'>
<re.Match object; span=(7, 8), match='e'>
<re.Match object; span=(10, 11), match='e'>
Python re.findall()
Method
re.findall()
returns a list or tuple of all non-overlapping matches of a pattern in a string. A string is scanned from the left to the right side. And the matches are returned in the order in which they were discovered.
import re as regex
# Find all capital words
string = ",,21312414.ABCDEFGw#########"
print(regex.findall(r"[A-Z]+", string))
Output:
['ABCDEFG']
Python re.MULTILINE
Method
A significant advantage of re.MULTILINE
is that it allows ^
to search for patterns at the beginning of every line instead of just at the beginning of the string.
Python Regex Symbols
Regex symbols can quickly become quite confusing when used in a complex manner. Below are some of the symbols used in our solutions to help better understand the underlying concept of these symbols.
^
asserts position at the start of a lineString
matches the (case sensitive) characters"String"
literally.
matches all characters (except for symbols used for line termination)+
matches the previously given token as often as possible.\n
matches a newline character\r
matches a (CR
) carriage return symbol?
matches the previous token between0-1
times+?
matches the previous token between1
toinfinite
times, as less as possible.a-z
matches a single character in the range betweena
andz
(case sensitive)
Use re.compile()
to Match a Multiline Block of Text in Python
Let’s understand using different patterns.
Pattern 1: Use re.search()
for Known Pattern
Example Code:
import re as regex
multiline_string = "Regular\nExpression"
print(regex.search(r"^Expression", multiline_string, regex.MULTILINE))
Output:
<re.Match object; span=(8, 18), match='Expression'>
The above expression first asserts its position at the start of the line (due to ^
) and then searches for the exact occurrences of "Expression"
.
Using the MULTILINE
flag ensures that each line is checked for occurrences of "Expression"
instead of just the first line.
Pattern 2: Use re.search()
for Unknown Pattern
Example Code:
import re as regex
data = """Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n
\n
IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.
"""
result = regex.compile(r"^(.+)(?:\n|\r\n)+((?:(?:\n|\r\n?).+)+)", regex.MULTILINE)
print(result.search(data)[0].replace("\n", ""))
Output:
Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.
The regex expression can be broken down and simplified into smaller chunks for better readability:
In the first capturing group (.+)
, each character is matched in the line (except for any symbols corresponding to line terminators); this process is done as often as possible.
After which, in the non-capturing group (?:\n|\r\n)
, just a line terminator or a line terminator and carriage return are matched as many times as possible.
As for the second capturing group ((?:(?:\n|\r\n?).+)+)
, it consists of a non-capturing group (?:(?:\n|\r\n?).+)+
either a new line character or a new line character and a carriage return are matched for a maximum of one time.
Every character is matched outside the non-capturing group, excluding line terminators. This procedure is done as many times as possible.
Pattern 3: Use re.finditer()
for Unknown Pattern
Example Code:
import re as regex
data = """Regex In Python
Regex is a feature available in all programming languages used to find patterns in text or data.
"""
query = regex.compile(r"^(.+?)\n([\a-z]+)", regex.MULTILINE)
for match in query.finditer(data):
topic, content = match.groups()
print("Topic:", topic)
print("Content:", content)
Output:
Topic: Regex In Python
Content:
Regex is a feature available in all programming languages used to find patterns in text or data.
The above expression can be explained as follows:
In the first capturing group (.+?)
, all characters are matched (except for line terminators, as before) as less as possible. After which, a single newline character \n
is matched.
After matching the newline character, the following operations are performed in the second capturing group (\n[a-z ]+)
. First, a newline character is matched, followed by matching characters between a-z
as many times possible.
Use re.findall()
to Match a Multiline Block of Text in Python
Example Code:
import re as regex
data = """When working with regular expressions, the sub() function of the re library is an invaluable tool.
the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found.
"""
query = regex.findall("([^\n\r]+)[\n\r]([a-z \n\r]+)", data)
for results in query:
for result in results:
print(result.replace("\n", ""))
Output:
When working with regular expressions, the sub() function of the re library is an invaluable tool.
the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found
To better understand the regex explanation, let’s break it down by each group and see what each part does.
In the first capturing group ([^\n\r]+)
, all characters are matched, excluding a newline symbol or a carriage return character, as often as possible.
After that, matches are made when a character is either a carriage return or newline in the expression [\n\r]
.
In the second capture group ([a-z \n\r]+)
, characters between a-z
or a newline or carriage return are matched as many times as possible.
Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.
LinkedIn