\myheading{Boyer-Moore string search algorithm explanation and formal verification using CBMC} \label{BoyerMoore} \leveldown{} \renewcommand{\CURPATH}{verif/boyer_moore} Previously I did this with Knuth-Morris-Pratt algorithm: \ref{KMP}. So again, let's try to reinvent string-search algorithm. \myheading{Version 1} If a substring would be compared in reverse order, things may be different. For example, we search for a 'CAT' substring in 'HORSE TURTLE CAT' string. With naive algorithm, we will first compare 'H' and 'C', we see they are unequal characters and we will advance substring one character ahead: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} Right, but what if we will start at the end of 'CAT' substring proceeding back? We first compare 'R' with 'T' (last character of 'CAT') and we see that they are not equal. We could advance substring one character ahead, but... Can't we see that 'R' character in the middle of 'HORSE' part is absent in the 'CAT' substring? Can't this tell us something useful? Yes -- since 'R' is not in 'CAT', we can advance 'CAT' substring more, because we know that this will fail for sure: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} This too: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} So if we see a character not in 'CAT', we can advance 3 characters ahead: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} But this will fail too, also, the space (' ') character is absent in 'CAT' too. OK, we advance 3 characters ahead: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} No, 'R' is not in 'CAT', advance again 3 characters ahead: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} Failing again... Adding 3 characters: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} This is important. 'A' != 'T', but 'A' is in the 'CAT' substring, so this time we advance only 1 character ahead, as in naive search algorithm: \begin{lstlisting} HORSE TURTLE CAT CAT ^ \end{lstlisting} Here we will compare 3 characters and return success. All this is way faster than naive string search algorithm, and is a strong competitor with Knuth-Morris-Pratt algorithm. This is the source code of a function that is hardcoded to search for 'cat' substring, nothing else: \lstinputlisting[style=customc]{\CURPATH/search_cat_BM_v1.c} As compared with a naive brute-force string algorithm, it shows a significant advantage: \begin{lstlisting} ** cacat search_cat_brute read_ops=7 search_cat_BM_v1 read_ops=5 ** cac cat search_cat_brute read_ops=10 search_cat_BM_v1 read_ops=5 ** cac cac cat search_cat_brute read_ops=17 search_cat_BM_v1 read_ops=7 ** horse turtle cat search_cat_brute read_ops=16 search_cat_BM_v1 read_ops=8 ** horse cat turtle search_cat_brute read_ops=9 search_cat_BM_v1 read_ops=5 \end{lstlisting} \myheading{Version 2} Now the next observation. If we search for 'CAT' and 'CACAT', we see that 'C' != 'T', but 'C' is in the 'CAT' substring, so we would advance by one character. \begin{lstlisting} CACAT CAT ^ \end{lstlisting} But... we can also align substring so that the first 'C' character in 'CAT' would match the next 'C' in 'CACAT', by advancing for 2 characters ahead: \begin{lstlisting} CACAT CAT ^ \end{lstlisting} We add a rule: "if a mismatched character in string exist in substring, align substring so that they would match" or, in other words, "if 'C' in string isn't equal to the last character in 'CAT' substring, advance by 2 characters": \begin{lstlisting}[style=customc] if (char3=='t') { ... } else { // 3rd character isn't 't' if ((char3!='c') && (char3!='a') && (char3!='t')) skip=3; if (char3=='c') // added skip=2; // added } i=i+skip; \end{lstlisting} It may perform even better on some strings: \begin{lstlisting} ** cacat ** cacat search_cat_brute read_ops=7 search_cat_brute read_ops=7 search_cat_BM_v1 read_ops=5 search_cat_BM_v2 read_ops=4 \end{lstlisting} As of aligning 'A' character in the middle of 'CAT' substring -- we wouldn't bother, because advancing substring by one character it's the same as is done by default. Again, our algorithm is hardcoded only for 'CAT'. And this is the universal algorithm from the R.Sedgewick's book, the first function construct a table of 'skips', the second does the actual search (Java code): \begin{lstlisting} public BoyerMoore(String pat) { this.R = 256; this.pat = pat; // position of rightmost occurrence of c in the pattern right = new int[R]; for (int c = 0; c < R; c++) right[c] = -1; for (int j = 0; j < pat.length(); j++) right[pat.charAt(j)] = j; } ... public int search(String txt) { int m = pat.length(); int n = txt.length(); int skip; for (int i = 0; i <= n - m; i += skip) { skip = 0; for (int j = m-1; j >= 0; j--) { if (pat.charAt(j) != txt.charAt(i+j)) { skip = Math.max(1, j - right[txt.charAt(i+j)]); break; } } if (skip == 0) return i; // found } return n; // not found \end{lstlisting} ( \href{https://algs4.cs.princeton.edu/code/edu/princeton/cs/algs4/BoyerMoore.java.html}{src} ) And what if we search for 'COCOS' substring? What if characters are repeating within search pattern? Well, we will count only rightmost characters, ignoring two first 'CO'. \myheading{Formal verification using CBMC} I did that already with my homebrew Knuth-Morris-Pratt algorithm: \ref{KMP}. \begin{lstlisting}[caption=Increase object-bits and unwind options if not enough] cbmc --object-bits 9 --trace --function check -DLEN=10 -DCBMC --unwind 9 --unwinding-assertions BM_cat_v1.c cbmc --object-bits 9 --trace --function check -DLEN=10 -DCBMC --unwind 9 --unwinding-assertions BM_cat_v2.c \end{lstlisting} \lstinputlisting{\CURPATH/cbmc.txt} Now important note. If you modify these lines so that this conditional expression would compare with other characters... \begin{lstlisting} ... { // 1st character isn't 'c' if ((char1!='c') && (char1!='a') && (char1!='t')) skip=3; } ... \end{lstlisting} CBMC will still verify this algorithm and prove its correctness, because a bug here wouldn't make the algorithm incorrect. It will degrade its performance to a level of naive string search algorithm, as if 'skip=3' here would be replaced with 'skip=1'. It will still work correctly, but slower. \myheading{But this algorithm is not finished.} I reimplemented it from R.Sedgewick book. And he states: \href{https://algs4.cs.princeton.edu/53substring/}{Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.} It coincides with 'delta1' rule from \href{\RepoURL/\CURPATH/files/fstrpos.pdf}{the original Boyer-Moore paper}. But there is also a 'delta2' rule, which is just another story for another time. \myheading{Files} \url{\RepoURL/\CURPATH/files} \myheading{Discussion at Hacker News} \url{https://news.ycombinator.com/item?id=26900640} \levelup{}