String Matching: Naive Algorithm
Some programs that you use everyday need to find occurrences of a string in a text.
Think about your preferred IDE or simply any text-editing program.
When you misspelling a variable name and you want to fix this, you need to find all the part of the text where the name is present and substitute it.
String Matching algorithms are widely used in the field of Bioinformatics for detect patterns in DNA or RNA sequences since they are represented digitally as strings over a well-defined alphabet.
The string matching (or pattern matching) problem is define as follows.
Given a text T and a pattern P, determine all the occurrences of the pattern in the text.
We assume that the text T is an array of length n and the pattern P is an array of m elements.
We say that pattern P occurs with shifts s in text T if
- 0 ≤ s ≤ n-m, and
- T[s+1 : s+m] = P[1 : m]
A shift is valid only if pattern P occurs with shift s in text T.
We can say that the string matching problem is the problem of finding all valid shifts with which the pattern P occurs in a text T.
The Naive Algorithm
Let’s see now a very simple (and inefficient) algorithm for string matching, the so called naive string-matching algorithm.
We use a for loop from 0 to n-m (respectively, the length of the text T and the length of the pattern P) to checks if P is equal to the substring of S taken from s+1 and s+m.
Pseudocode
NaiveStringMatcher(T,P):
n = length(T)
m = length(P)
FOR s = 0 to n-m do:
IF P[1 : m] == T[s+1 : s+m]:
print "Pattern occurs with shift " s
The pattern slides over the text and if the pattern is found in the text starting from s (the if statement is true), we print out the shift s.
This is an inefficient algorithm for long text. Why?
Because for every possible n-m+1 shift we have to compare all the character of P with the m character of T[s+1 : s+m].
The worst case running time is Θ((n-m+1)*m).
If m is equal to n/2, the running time is Θ(n²).
Golang implementation
Now we see how this naive algorithm can be implemented in golang.
func NaiveStringMatching(text string, pattern string) { n := len(text)
m := len(pattern) for s := 0; s < (n-m)+1; s++ {
if pattern == text[s:s+m] {
fmt.Println("Pattern occurs with shift", s)
}
}
}
EXAMPLE OF EXECUTION
With Text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
ché la diritta via era smarrita.
Ahi quanto a dir qual era è cosa dura
esta selva selvaggia e aspra e forte
che nel pensier rinova la paura!- Incipit of Dante Alighieri’s Divina Commedia
And Pattern “selva”, the output of the above algorithm is:
Pattern occurs with shift 56
Pattern occurs with shift 148
Pattern occurs with shift 154
This means that we can found the string “selva” in the text T three time:
- From position 56 to 56+m (where m in this case is 5), T[56 : 56 + 5]
- From position 148 to 148+5, T[148 : 148 + 5]
- From 154 to 154 + 5, T[154 : 154 + 5]
func main() {
text := "Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura, ché la diritta via era smarrita. Ahi quanto a dir qual era è cosa dura esta selva selvaggia e aspra e forte che nel pensier rinova la paura!"
pattern := "selva" NaiveStringMatching(text, pattern) fmt.Println("T[56:56+5] -> ", text[56:56+5])
fmt.Println("T[146:148+5] -> ", text[148:148+5])
fmt.Println("T[154:154+5] -> ", text[154:154+5])
}// OUTPUT: go run main.go
Pattern occurs with shift 56
Pattern occurs with shift 148
Pattern occurs with shift 154
T[56:56+5] -> selva
T[146:148+5] -> selva
T[154:154+5] -> selva
And now?
There are other types of string matching algorithm that works better.
The main defect of the Naive Algorithm is that it ignores the information gained on a value of s when it consider other values of s.
In the next posts, we will see other algorithms, like the Knuth-Morris-Pratt algorithm and another one that uses finite automata.