Longest Common Subsequence Explanation

One of the most important implementations of Dynamic Programming is finding out the Longest Common Subsequence. Let’s define some of the basic terminologies first.

Subsequence:

A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. Let’s say we have a string ABC. If we erase zero or one or more than one character from this string we get the subsequence of this string. So the subsequences of string ABC will be {“A”, “B”, “C”, “AB”, “AC”, “BC”, “ABC”, ” “}. Even if we remove all the characters, the empty string will also be a subsequence. To find out the subsequence, for each characters in a string, we have two options - either we take the character, or we don’t. So if the length of the string is n, there are 2^n2n subsequences of that string.

Longest Common Subsequence:

As the name suggest, of all the common subsequencesbetween two strings, the longest common subsequence(LCS) is the one with the maximum length. For example: The common subsequences between “HELLOM” and “HMLD” are “H”, “HL”, “HM” etc. Here “HLL” is the longest common subsequence which has length 3.

Brute-Force Method:

We can generate all the subsequences of two strings using backtracking. Then we can compare them to find out the common subsequences. After we’ll need to find out the one with the maximum length. We have already seen that, there are 2n subsequences of a string of length n. It would take years to solve the problem if our n crosses 20-25.

Dynamic Programming Method:

Let’s approach our method with an example. Assume that, we have two strings abcdaf and acbcf. Let’s denote these with s1 and s2. So the longest common subsequence of these two strings will be “abcf”, which has length 4. Again I remind you, subsequences need not be continuous in the string. To construct “abcf”, we ignored “da” in s1 and “c” in s2. How do we find this out using Dynamic Programming?

We’ll start with a table (a 2D array) having all the characters of s1 in a row and all the characters of s2 in column. Here the table is 0-indexed and we put the characters from 1 to onwards. We’ll traverse the table from left to right for each row. Our table will look like:

0     1     2     3     4     5     6
+-----+-----+-----+-----+-----+-----+-----+-----+
| chʳ |     |  a  |  b  |  c  |  d  |  a  |  f  |
+-----+-----+-----+-----+-----+-----+-----+-----+
0  |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
1  |  a  |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
2  |  c  |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
3  |  b  |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
4  |  c  |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
5  |  f  |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+

Here each row and column represent the length of the longest common subsequence between two strings if we take the characters of that row and column and add to the prefix before it. For example: Table[2][3] represents the length of the longest common subsequence between “ac” and “abc”.

The 0-th column represents the empty subsequence of s1. Similarly the 0-th row represents the empty subsequence of s2. If we take an empty subsequence of a string and try to match it with another string, no matter how long the length of the second substring is, the common subsequence will have 0 length. So we can fill-up the 0-th rows and 0-th columns with 0’s. We get:

0     1     2     3     4     5     6
+-----+-----+-----+-----+-----+-----+-----+-----+
| chʳ |     |  a  |  b  |  c  |  d  |  a  |  f  |
+-----+-----+-----+-----+-----+-----+-----+-----+
0  |     |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
+-----+-----+-----+-----+-----+-----+-----+-----+
1  |  a  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
2  |  c  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
3  |  b  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
4  |  c  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
5  |  f  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+

Let’s begin. When we’re filling Table[1][1], we’re asking ourselves, if we had a string a and another string a and nothing else, what will be the longest common subsequence here? The length of the LCS here will be 1. Now let’s look at Table[1][2]. We have string ab and string a. The length of the LCS will be 1. As you can see, the rest of the values will be also 1 for the first row as it considers only string a with abcd, abcda, abcdaf. So our table will look like:

0     1     2     3     4     5     6
+-----+-----+-----+-----+-----+-----+-----+-----+
| chʳ |     |  a  |  b  |  c  |  d  |  a  |  f  |
+-----+-----+-----+-----+-----+-----+-----+-----+
0  |     |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
+-----+-----+-----+-----+-----+-----+-----+-----+
1  |  a  |  0  |  1  |  1  |  1  |  1  |  1  |  1  |
+-----+-----+-----+-----+-----+-----+-----+-----+
2  |  c  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
3  |  b  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
4  |  c  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
5  |  f  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+

For row 2, which will now include c. For Table[2][1] we have ac on one side and a on the other side. So the length of the LCS is 1. Where did we get this 1 from? From the top, which denotes the LCS a between two substrings. So what we are saying is, if s1[2] and s2[1] are not same, then the length of the LCS will be the maximum of the length of LCS at the top, or at the left. Taking the length of the LCS at the top denotes that, we don’t take the current character from s2. Similarly, Taking the length of the LCS at the left denotes that, we don’t take the current character from s1 to create the LCS. We get:

0     1     2     3     4     5     6
+-----+-----+-----+-----+-----+-----+-----+-----+
| chʳ |     |  a  |  b  |  c  |  d  |  a  |  f  |
+-----+-----+-----+-----+-----+-----+-----+-----+
0  |     |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
+-----+-----+-----+-----+-----+-----+-----+-----+
1  |  a  |  0  |  1  |  1  |  1  |  1  |  1  |  1  |
+-----+-----+-----+-----+-----+-----+-----+-----+
2  |  c  |  0  |  1  |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
3  |  b  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
4  |  c  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
5  |  f  |  0  |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+

So our first formula will be:

if s2[i] is not equal to s1[j]
    Table[i][j] = max(Table[i-1][j], Table[i][j-1]
endif