How to Find the Longest Common Substring in C++
In this article, we will discuss the longest common substring problem and its solution using dynamic programming in C++.
Longest Common Substring
The longest common substring (LCS) is a well-known problem in computer science. A maximum length substring in two or more strings is the longest common substring.
For example, consider the following three strings:
XYXYZ
YXYZX
XYZYX
The longest common substring of these three strings is XYZ
. Other common substrings exist in these three strings like X
, XY
, Y
, YX
, YZ
, and Z
; however, the longest common substring is XYZ
only.
Figure 1 shows the longest common substring of the above three strings:
Problem Formulation of the Longest Common Substring
Consider two strings, S1
and S2
, with lengths m
and n, respectively, to find the exact substring from both S1
and S2
, with the maximum length.
We will present two solutions to this problem:
- A Simple Solution
- Dynamic Programming Solution
A Simple Solution
Following are the steps to find the longest common substring from two strings using the simple solution:
-
Find all substrings of the first string
S1
. -
For every substring of
S1
, check whether it is a substring ofS2
or not. -
Note that the maximum length substring matches the
S2
substring. -
As the length of the string
S1
ism
, there are $\mathcal{O}(m^2)$ substrings. -
The length of string
S2
isn
, and $\mathcal{O}(n)$ is required to find whether a string is a substring or not. -
The total time complexity required to perform this solution is $\mathcal{O}(nm^2)$.
This solution can find the longest common substring; however, it will take time, so let’s look at the dynamic programming solution to achieve the same goal in less time.
Find the Longest Common Substring Through Dynamic Programming
We can use dynamic programming (DP) to find the longest common substring (LCS) from two strings. Following is the pseudocode for finding the longest common substring using the dynamic programming approach:
function Longest_Common_Substring(S1[1..m], S2[1..n])
DP_Matrix = array(1..m, 1..n)
len = 0
result_substring = {}
for i = 1..m
for j = 1..n
if S1[i] == S2[j]
if i = 1 or j = 1
DP_Matrix[i, j] = 1
else
DP_Matrix[i, j] = DP_Matrix[i − 1, j − 1] + 1
if DP_Matrix[i, j] > len
len = DP_Matrix[i, j]
result_substring = {S1[i − len + 1..i]}
else if DP_Matrix[i, j] == len
result_substring = result_substring ∪ {S1[i − len + 1..i]}
else
DP_Matrix[i, j] = 0
return result_substring
The above solution takes quadratic (i.e., $\mathcal{O}(mn)$) time complexity to calculate the longest common substring. The solution is optimal and always guarantees the longest sub-match.
Note: The arrays in pseudocodes start from index 1. Moreover, this algorithm works fine only with non-empty strings.
The DP_Matrix
is a 2D array or matrix used as a dynamic programming table to store the substring length for the prefixes S1[1..i]
and S2[1..j]
with ending positions S1[i]
and S2[j]
.
The variable len
is used to store the current largest result_substring
length, which has been found so far. The substring is used to store the string of length len
.
All the cells in the first row and column of the DP_matrix
are initialized to 1
. An objective function is used to calculate the remaining cells.
The objective function for this formulation can be presented as:
- For a given cell
DP_Matrix[i,j]
, if the corresponding characters of given string sequences (i.e., characters atS1[i]
andS2[j]
) are matching, then we add1
to the value of cellDP_Matrix[i-1,j-1]
.- If the updated value of cell
DP_Matrix[i,j]
is greater than the previous length of substring, then updatelen
with this new value. - Simultaneously, save this new maximum string to an auxiliary string holder (i.e.,
result_substring
).
- If the updated value of cell
- If the corresponding characters are not matching (i.e., a mismatch case), then set the cell
DP_Matrix[i, j]
with a zero value, indicating that only a new sub-string can start from here.
After all the M*N
iterations (i.e., end of the for
loops), the result_substring
will have an optimally largest common sub-string.
Longest Common Substring Dynamic Programming Solution in C++
The following code is used to find the longest common substring from the strings using dynamic programming in C++.
#include <iostream>
#include <string>
using namespace std;
string Longest_Common_Substring(string Str1, string Str2) {
string result_substring = "";
int m = Str1.length();
int n = Str2.length();
int DP_Matrix[m + 1][n + 1];
int len = 0;
int end = 0;
for (int i = 0; i <= m; i++) {
for (int j = 0; j <= n; j++) {
if (Str1[i] == Str2[j]) {
if ((i == 0) || (j == 0))
DP_Matrix[i][j] = 1;
else
DP_Matrix[i][j] = 1 + DP_Matrix[i - 1][j - 1];
if (DP_Matrix[i][j] > len) {
len = DP_Matrix[i][j];
int start = i - DP_Matrix[i][j] + 1;
if (end == start) {
result_substring.push_back(Str1[i]);
} else {
end = start;
result_substring.clear();
result_substring.append(Str1.substr(end, (i + 1) - end));
}
}
} else {
DP_Matrix[i][j] = 0;
}
}
}
return result_substring;
}
int main() {
string String1, String2;
cout << "Enter the first string: ";
cin >> String1;
cout << "Enter the second string: ";
cin >> String2;
string substring = Longest_Common_Substring(String1, String2);
cout << "LCS of " << String1 << " and " << String2 << " is " << substring
<< " of Length " << substring.length() << endl;
return 0;
}
The above code contains the function Longest_Common_Substring
, which accepts two strings (i.e., String1
and String2
) as a parameter and returns their optimal largest common substring. The Longest_Common_Substring
function uses the dynamic programming table method to find this LCS.
In the above code, the DP_Matrix
is a 2D integer array that stores the current best substring length for the strings sub-pair S1[1..i]
and S2[1..j]
. The result_substring
, as discussed in the pseudocode explanation, is used to store the longest substring found.
Let’s look at the outputs of the above code for different string pairs:
Output 1:
Enter the first string: ATGCACATA
Enter the second string: GCAATGCACT
LCS of ATGCACATA and GCAATGCACT is ATGCAC of Length 6
Output 2:
Enter the first string: ABCDDFA
Enter the second string: DCDDACDDF
LCS of ABCDDFA and DCDDACDDF is CDDF of Length 4
Besides the theoretical or mathematical proof, the above outputs suggest that the discussed dynamic programming solution is optimal.
Husnain is a professional Software Engineer and a researcher who loves to learn, build, write, and teach. Having worked various jobs in the IT industry, he especially enjoys finding ways to express complex ideas in simple ways through his content. In his free time, Husnain unwinds by thinking about tech fiction to solve problems around him.
LinkedIn