Wednesday, February 13, 2008

[Python]Simple Impl. of Bloom Filter

Bloom filter is a data structure to tell whether an object is in given list or not. Of cource, if you hold the given list you can do it. The merit of bloom filter is it doesn't need to hold the original list. As a result you can save a memory space. However, it lost an accuracy of result as the price of space-efficiency. It is possible to answer 'yes' while a query is not in the list in a probability. Because you can estimate the probability, bloom filter remains useful.

The following source is a simple implementation of a bloom filter to learn algorithm.


SIZE = 1987
def hashes(s):
xs = [0, 0, 0]
for c in s:
o = ord(c)
xs[0] = xs[0] * 137 + o
xs[1] = xs[1] * 69 + o
xs[2] = xs[2] * 545 + o

return [x % SIZE for x in xs]

class BloomFilter(object):
def __init__(self):
self.bitarray = [0] * SIZE

def add(self, s):
for x in hashes(s):
self.bitarray[x] = 1

def query(self, s):
return all(
self.bitarray[x] == 1
for x in hashes(s))


The function 'hashes' calculates three hash value using three different hash function. Class 'BloomFilter' has a bit-array. For simple implementation, I use a list as a substitute for a bit array. It decleases the bloom filter's merit much so you shouldn't use the code in practical use.

The following list is the result of interactive execution on a python's shell.

>>> bf = BloomFilter()
>>> bf.add("hoge")
>>> bf.query("hoge")
True
>>> bf.query("hoga")
False
>>> bf.add("foo")
>>> bf.add("bar")
>>> bf.add("baz")
>>> bf.query("hoga")
False
>>> bf.query("foo")
True


When a string added, the bloom filter write '1' on the position of bit array corresponds to each hash value. When a string queried, the bloom filter checks each position and if and only if all of them are '1' it says 'True'. You can imagine more strings are added, the bit array has more '1', and bloom filter more possibly says 'True'. The probability can easily estimate. Given m is the size of bit array, k is the number of hash functions and n is the number of strings you added, the probability is (1 - exp(-float(k * n) / m)) ** k. In this case, m equals 1987 and k equals 3, so when n is 100 the prob. is 0.003 and when n is 1000 the prob. is 0.47. You have to increase the size of bit array if you want to add 1000 strings to a bloom filter.

No comments: