String Matching: Naive Algorithm

Giacomo Zanatta
4 min readJan 5, 2021

Some programs that you use everyday need to find occurrences of a string in a text.
Think about your preferred IDE or simply any text-editing program.
When you misspelling a variable name and you want to fix this, you need to find all the part of the text where the name is present and substitute it.

When you type ctrl + F in vscode you are going to do a string matching

String Matching algorithms are widely used in the field of Bioinformatics for detect patterns in DNA or RNA sequences since they are represented digitally as strings over a well-defined alphabet.

The string matching (or pattern matching) problem is define as follows.

Given a text T and a pattern P, determine all the occurrences of the pattern in the text.

We assume that the text T is an array of length n and the pattern P is an array of m elements.
We say that pattern P occurs with shifts s in text T if

  • 0 ≤ s ≤ n-m, and
  • T[s+1 : s+m] = P[1 : m]

A shift is valid only if pattern P occurs with shift s in text T.
We can say that the string matching problem is the problem of finding all valid shifts with which the pattern P occurs in a text T.

The Naive Algorithm

Let’s see now a very simple (and inefficient) algorithm for string matching, the so called naive string-matching algorithm.

We use a for loop from 0 to n-m (respectively, the length of the text T and the length of the pattern P) to checks if P is equal to the substring of S taken from s+1 and s+m.

Pseudocode

NaiveStringMatcher(T,P):
n = length(T)
m = length(P)
FOR s = 0 to n-m do:
IF P[1 : m] == T[s+1 : s+m]:
print "Pattern occurs with shift " s

The pattern slides over the text and if the pattern is found in the text starting from s (the if statement is true), we print out the shift s.

This is an inefficient algorithm for long text. Why?
Because for every possible n-m+1 shift we have to compare all the character of P with the m character of T[s+1 : s+m].
The worst case running time is Θ((n-m+1)*m).
If m is equal to n/2, the running time is Θ(n²).

Golang implementation

Now we see how this naive algorithm can be implemented in golang.

func NaiveStringMatching(text string, pattern string) {   n := len(text)
m := len(pattern)
for s := 0; s < (n-m)+1; s++ {
if pattern == text[s:s+m] {
fmt.Println("Pattern occurs with shift", s)
}
}
}

EXAMPLE OF EXECUTION

With Text

Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
ché la diritta via era smarrita.
Ahi quanto a dir qual era è cosa dura
esta selva selvaggia e aspra e forte
che nel pensier rinova la paura!

- Incipit of Dante Alighieri’s Divina Commedia

And Pattern “selva”, the output of the above algorithm is:

Pattern occurs with shift 56
Pattern occurs with shift 148
Pattern occurs with shift 154

This means that we can found the string “selva” in the text T three time:

  • From position 56 to 56+m (where m in this case is 5), T[56 : 56 + 5]
  • From position 148 to 148+5, T[148 : 148 + 5]
  • From 154 to 154 + 5, T[154 : 154 + 5]
func main() {
text := "Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura, ché la diritta via era smarrita. Ahi quanto a dir qual era è cosa dura esta selva selvaggia e aspra e forte che nel pensier rinova la paura!"
pattern := "selva"
NaiveStringMatching(text, pattern) fmt.Println("T[56:56+5] -> ", text[56:56+5])
fmt.Println("T[146:148+5] -> ", text[148:148+5])
fmt.Println("T[154:154+5] -> ", text[154:154+5])
}
// OUTPUT: go run main.go

Pattern occurs with shift 56
Pattern occurs with shift 148
Pattern occurs with shift 154
T[56:56+5] -> selva
T[146:148+5] -> selva
T[154:154+5] -> selva

And now?

There are other types of string matching algorithm that works better.
The main defect of the Naive Algorithm is that it ignores the information gained on a value of s when it consider other values of s.
In the next posts, we will see other algorithms, like the Knuth-Morris-Pratt algorithm and another one that uses finite automata.

--

--