• Tidak ada hasil yang ditemukan

APPROXIMATE STRING MATCHING

It is typical to identify a set of common problems that might have caused mis- matches between the substring and the text with which it is being matched.

Those differences are that the corresponding characters in the substring and text are different, the substring has a character the text doesn’t have, or the text has a character the substring doesn’t have. Typically, typing errors fall into one of these three types, with a common error of transposed characters being treated as two character differences of the first type.

We will typically look for a k-approximate match for the substring, where k represents the maximum number of differences of the kind mentioned in the previous paragraph. There are a number of possibilities that we will need to keep track of. For example, what does it mean if the first character of the sub- string and text do not match? It could mean that there is a mismatch of char- acters, there is a character missing from the substring, or a character is missing from the text. If the characters do match, getting a better overall match of the entire substring may still require that we consider the case of a character miss- ing from the pattern or the text.

For example, consider the attempt to match the substring “ad” with the text

“read.” The first position has two possible 2-approximate matches. (The a is changed to an r and the d is changed to an e, or there could be an “re” added to the front of the string.) There is also a 3-approximate match at the first

position. (Add an r and change the “ad” to an “ea”.) The second position has a 2-approximate match (change the “ad” to an “ea”) and a 1-approximate match (add an e to the front).

Notice that there can be a lot of possibilities and they build very quickly. If the first few characters matched, but then we hit a sequence that didn’t, we might find a better match if we changed some characters or put some extra characters into the substring, into the text, or into both. How can we consider the possibilities and still do this with a reasonable algorithm and data structure?

If the algorithm has to test all possibilities, it will be too complex. Therefore, we make the algorithm simple but have a larger data structure.

We will solve this problem by creating a matrix that we will call diffs to hold the information that we have gathered so far. Each row of this matrix will be associated with one of the characters in the substring, and each column will be associated with one of the characters in the text. The values in the matrix will give us an idea of how well the matching process is going at that point. So, if the value in row 5 column 27 is a 4, in matching the first five characters of the substring with the portion of the text ending at location 27, we have found four differences.

The number of differences for any location will be based on the three possi- ble values that are immediately above, to the left, and to the left and diagonally up. If we use the value above, we are implying that the text is missing a charac- ter of the substring. If we use the value to the left, we are implying that the substring is missing a character of the text. Use of the diagonal value is related to the match or mismatch of the characters. More specifically, for any value of diffs[i,j], we will look at the minimum of three values.3

1. diffs[i 1, j 1] if substringi = textj,otherwise diffs[i 1, j 1] + 1 2. diffs[i 1, j] + 1 (substringiis missing from the text)

3. diffs[i,j 1] + 1 (textjis missing from the substring)

To get this process started, if we refer to any location above the matrix (in other words, i= 0), that location will be considered to have a zero stored in it.

If we refer to any location to the left of the matrix (in other words, j = 0), that location will be considered to have the corresponding value of i stored in it. A

3 Notice that diffs[i 1, j 1] represents the value diagonally up and to the left, diffs[i 1, j] represents the value above, and diffs[i,j 1] represents the value to the left.

sample of this for the substring “trim” and the text “try the trumpet” is given in Fig. 5.9.

If we look at the bottom row for the character y in the text, we see the value 2, which represents the fact that to match “trim” so that it ends at the y would require two differences of the kind discussed before. Those two differ- ences would represent an m missing from the text after the y and the mismatch of the y with the i or an i missing from the text before the y and the mismatch of the y with the m. So, the bottom row gives us the best possible matches of the substring ending at that point in the text. We see that the closest match of

“trim” in the text, with one difference, would end at the m in trumpet and represents the mismatch of the i and u.

If this process were used in practice, we would specify not only the substring and text but also the maximum number of differences for which we were looking. The algorithm would fill in the matrix column by column until the bottom value of a column was less than or equal to the number given. This means that the algorithm does not need to store S*T integers for this matrix (where S is the number of characters in the substring and T is the number of characters in the text), but rather it just needs to store 2S integers for the col- umn being calculated and for the previous column on which it depends.

This style of algorithm is classified as “dynamic programming,” which will be discussed again in Chapter 9.

5.2.1 Analysis

This process is easy to analyze because of the nature of the matrix. We see that for each location in the matrix, we do one character comparison. This means that in the worst case there will be S*T comparisons. Notice that even with all of the possible differences that could occur, this process operates as effi- ciently as the straightforward exact string match algorithm.

t

t t h e t e t

r

r y r u m p

i m

0 1 2 3 4

0 0 1 2 3

0 0 1 2 3

0 0 1 2 3 0

0 1 2 3 0

1 0 1 2

0 1 0 1 2 0

1 1 1 2

0 1 1 1 2 0

1 1 2 3 0

1 2 2 2

0 1 2 2 3

0 1 2 2 1 0

1 2 3 3

0 1 2 3 3 0 1 2 3 2

FIGURE 5.9 The diffs matrix for the substring “trim”

and the text “try the trumpet”

5.2.2

1. Construct the approximate matches matrix for substring “their” and text

“hello there friends.”

2. Construct the approximate matches matrix for substring “where” and text

“were they here.”

3. When we looked for an exact match, we could determine the starting point easily, because we knew the start had to be S characters from the end. With approximate matching, finding the start of the match is not so easy because of the possibility of characters missing from the substring, the text, or both.

Give a detailed description of what data structure and process you would add to the algorithm described in this lesson to have that information avail- able when a k-approximate match is found. (Hint: One way to see if the parenthesis in an expression match is to keep a counter as you scan the expression, adding 1 to it on every open parenthesis, subtracting 1 from it on every close parenthesis, and not changing it for other characters. Can you do something similar to keep track of missing characters?)