Wednesday, 16 March 2011

Boyer Moore Fast string searching algorithm

“ We have seen a strstr implementation in previous articles. It takes O(n) time and not the best string search algorithm when it comes to large text. Lets say, you need search a 1 GB file. We need to optimize string search in some way”

See also

http://www.movsd.com/bm.htm

http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://analgorithmaday.blogspot.com/2011/03/string-of-string-search.html

Introduction

     I didn’t find much examples and easy to follow code on the internet except some of the links given above. Read through it before getting some idea behind this optimized algorithm.

  The main optimization lies in Character jump heuristics. We skip some unnecessary comparisons by starting the comparison between the source and the pattern, from the end of the pattern instead of the start of the pattern.

  In our str_str implementation, we start with the initial character in the pattern. But here, we do it from the end. But, If you move from backwards, what you get. You can skip further comparisons to the start of the pattern. But how we decide we can skip further comparison?

Concept

   The skip we do if there is a mismatch in last letter comparison follows little bit of size heuristics. All the text search/match algorithm follows this strict and very simple to follow heuristics.

  Example,

Source:   I think am in love with you

Pattern:   love

Since we compare from the end of the pattern, “e” and “h” are compared. Both are not matching, since we matched the pattern of size of “4” with source of size “4” and found that the last “e” and “h” are not matching, we can jump!!.. This is also known as the good suffix and uses the simple size heuristics.

What is character jump heuristics then?. Some time, there might be some character matching and some would not.

Example,

Source:   aabaaabbbbbbaaaaabbabaaaaaaaaaa

Pattern:   bab

If we start comparing the above two strings, you get “ab” at many places. But at the third comparison, there are failures. This is a bad suffix problem.

In this case, we need to take care of the position of mismatch, in first run, its “b” and “a”, in the pattern and shift. For this we create a table. This is called character heuristic table.

This table is an ASCII table of size 128 for each ASCII character. ‘a' = 97, ‘z’ =122, ‘A’=65, ‘Z’=90. So, for this pattern, the count will come like this,

  b, a, b = 2, 1, 0

We create a integer array of size 128 and use character itself as an index to it as its C++. other languages can use hash tables etc :)

So, table[‘b’] =2 and table[‘a’]=1. For the above case, so, if a mismatch comes with ‘a’ in source, jump 1. if mismatch comes with ‘b’ in  source, jump 2 in source. In the above case, the jumps are less but when b compares with a, we shift more. :)

The above case is little bit like worst case for our algorithm :)

This algorithm gives linear time O(n) in worst case but most of the cases, works with less than O(n) time.

Code

 
int boyer_moore(char* sstr, char* pattern)
{
    char* inits = sstr;
    char* initp = pattern;
 
    int spatt = strlen(pattern);
    while(*pattern != '\0') pattern++;
    // this algorithm tested for printable ASCII characters
    // from ASCII, 65-90 and 97-122
    int* jump_table=(int*) calloc(128, sizeof(int));
    int count=0;
 
    while(pattern != initp) {
        pattern--;
        jump_table[*pattern]=count;
        count++;
    }
 
    char* stmp=0;
    char* iter=0;
    int shift=0;
    int bad_count=0;
 
    while(*sstr != '\0')
    {
        bad_count=spatt;
        stmp = sstr+ (spatt-1);
        iter = pattern + (spatt-1);
        while(*stmp == *iter) {
            bad_count--;
            stmp--;
            iter--;
            if(bad_count==0)
                return sstr-inits;
        }
 
        //jump table
        (jump_table[*stmp] == 0)?shift=bad_count:shift=jump_table[*stmp];
        sstr += shift;
    }
    //not found
    return -1;
}
 
 
void main()
{
    char* source = "aabaaabbbbbbaaaaabbabaaaaaaaaaa";
    char* pattern = "bab";
 
    std::cout<<boyer_moore(source, pattern);
}

Important points

  • This is plain copy of the concepts extracted from http://www.movsd.com/bm.htm
  • There is some terminology called good suffix, bad suffix.
    • good suffix= if we do a full shift of size equal to pattern
    • bad suffix = if we do partial shift
  • These terminologies are really simple and we are not using separate table for bad suffix.
  • bad suffix is calculated using bad_count. good_suffix is taken from jump_table[mismatch]
  • only if mismatch character is not in the pattern, we go for a bad shift which incurs more comparisons.
  • Not found is the case when sstr reaches null.
  • Problems while writing code is mainly the confusion of creating tables. Reverse comparison and main loop is almost similar to str_str.
  • The above code has a bug with repeated strings and patterns
  • Try the above code for below input, it FAILS
    • Source: aabaaabbbbbbaaaaabbabaaaaaaaaaa
    • Pattern: baaaa

The below code is the bug fix for adding additional heuristics for repeated characters in patterns. This is based on the concept given under “The additional heuristic” in http://www.movsd.com/bm.htm

We maintain the count of number of characters found and subtract that with the shift found from the pattern jump table. It will never go less than 1 or 0 in normal case. But in our special bug case, it will go to 0. :) So, we shift just 1 instead of shifting beyond with the jump_table since the calculation in the jump table replaces the pattern calculation.

For above, the pattern jump table will be like, table[a] = 4 and table[b]=5. the index of ‘a’s in other places are not considered.

It’s better to have the additional heuristic as well to make it work for all cases.

int boyer_moore(char* sstr, char* pattern)
{
    char* inits = sstr;
    char* initp = pattern;
 
    int spatt = strlen(pattern);
    while(*pattern != '\0') pattern++;
    // this algorithm tested for printable ASCII characters
    // from ASCII, 65-90 and 97-122
    int* jump_table=(int*) calloc(128, sizeof(int));
    int count=0;
 
    while(pattern != initp) {
        pattern--;
        jump_table[*pattern]=count;
        count++;
    }
 
    char* stmp=0;
    char* iter=0;
    int shift=0;
    int bad_count=0;
    int bcount=0;
    while(*sstr != '\0')
    {
        bcount=0;
        bad_count=spatt;
        stmp = sstr+ (spatt-1);
        iter = pattern + (spatt-1);
        while(*stmp == *iter) {
            bad_count--;
            bcount++;
            stmp--;
            iter--;
            if(bcount==spatt)
                return sstr-inits;
        }
 
        //jump table
        if(jump_table[*stmp] == 0) {
            // the character not found in pattern
            shift=bad_count;
        } else {
            shift=jump_table[*stmp];
            (shift - bcount < 1)?shift = 1: shift = shift-bcount;
        }
        sstr += shift;
    }
    //not found
    return -1;
}
 
 
void main()
{
    char* source = "aabaaabbbbbbaaaaabbabaaaaaaaaaa";
    char* pattern = "baaaa";
 
    std::cout<<boyer_moore(source, pattern);
}