# Bitap算法

Bitap算法（或称为shift-orshift-andBaeza-Yates–Gonnet算法）是一种字符串近似匹配算法。该算法可判断给定的文本是否包含与定义模式“近似相等”的子字符串。其中，根据萊文斯坦距離 – 如果子字符串和定义模式彼此在给定距离“K”之内，则该算法认为他们近似。该算法预先计算一组位掩码，其中每个位掩码的每一个元素都包含一个模式。然后，它可以通过按位操作以完成大部分工作。

## 精确搜寻

```algorithm bitap_search is
input: text as a string.
pattern as a string.
output: string
m := length(pattern)

if m = 0 then
return text

/* 初始化位数组R */
R := new array[m+1] of bit, initially all 0
R[0] := 1

for i := 0; i < length(text); i += 1 do
/* Update the bit array. */
for k := m; k ≥ 1; k -= 1 do
R[k] := R[k - 1] & (text[i] = pattern[k - 1])

if R[m] then
return (text + i - m) + 1

return null
```

``` #include <string.h>
#include <limits.h>

const char *bitap_bitwise_search(const char *text, const char *pattern)
{
int m = strlen(pattern);
unsigned long R;
int i;

if (pattern[0] == '\0') return text;
if (m > 31) return "The pattern is too long!";

/* 初始化位数组R */
R = ~1;

/* 初始化位掩码 */
for (i=0; i <= CHAR_MAX; ++i)
for (i=0; i < m; ++i)

for (i=0; text[i] != '\0'; ++i) {
/* 更新位数组 */
R <<= 1;

if (0 == (R & (1UL << m)))
return (text + i - m) + 1;
}

return NULL;
}
```

## 模糊搜寻

``` #include <stdlib.h>
#include <string.h>
#include <limits.h>

const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
{
const char *result = NULL;
int m = strlen(pattern);
unsigned long *R;
int i, d;

if (pattern[0] == '\0') return text;
if (m > 31) return "The pattern is too long!";

/* 初始化位数组R */
R = malloc((k+1) * sizeof *R);
for (i=0; i <= k; ++i)
R[i] = ~1;

/* 初始化位掩码 */
for (i=0; i <= CHAR_MAX; ++i)
for (i=0; i < m; ++i)

for (i=0; text[i] != '\0'; ++i) {
/* 更新位数组 */
unsigned long old_Rd1 = R[0];

R[0] <<= 1;

for (d=1; d <= k; ++d) {
unsigned long tmp = R[d];
R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
old_Rd1 = tmp;
}

if (0 == (R[k] & (1UL << m))) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}
```

## 参考文献

1. ^ Bálint Dömölki, An algorithm for syntactical analysis, Computational Linguistics 3, Hungarian Academy of Science pp. 29–46, 1964.
2. ^ Bálint Dömölki, A universal compiler system based on production rules, , 8(4), pp 262–275, 1968. doi:10.1007/BF01933436
3. ^ R. K. Shyamasundar, Precedence parsing using Dömölki's algorithm, , 6(2)pp 105–114, 1977.
4. ^ Ricardo Baeza-Yates. "Efficient Text Searching." PhD Thesis, University of Waterloo, Canada, May 1989.
5. ^ Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science, University of Arizona, Tucson, June 1991. (gzipped PostScript页面存档备份，存于互联网档案馆）)
6. ^ Ricardo Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." Communications of the ACM, 35(10): pp. 74–82, October 1992.
7. ^ Udi Manber, Sun Wu. "Fast text search allowing errors." Communications of the ACM, 35(10): pp. 83–91, October 1992, doi:10.1145/135239.135244.
8. ^ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1–23, Irvine, CA, June 1996.
9. ^ G. Myers. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of the ACM 46 (3), May 1999, 395–415.
10. Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval. 1999. ISBN 0-201-39829-X.

## 外部链接

• libbitap，一个自由的实例，它展示了如何轻松地将算法扩展为大多数正则表达式。与上面的代码不同，它对模式长度没有限制。
• bitap.py页面存档备份，存于互联网档案馆） - Python实现：Wu-Manber修改的Bitap算法。