The following source is a simple implementation of a bloom filter to learn algorithm.

SIZE = 1987

def hashes(s):

xs = [0, 0, 0]

for c in s:

o = ord(c)

xs[0] = xs[0] * 137 + o

xs[1] = xs[1] * 69 + o

xs[2] = xs[2] * 545 + o

return [x % SIZE for x in xs]

class BloomFilter(object):

def __init__(self):

self.bitarray = [0] * SIZE

def add(self, s):

for x in hashes(s):

self.bitarray[x] = 1

def query(self, s):

return all(

self.bitarray[x] == 1

for x in hashes(s))

The function 'hashes' calculates three hash value using three different hash function. Class 'BloomFilter' has a bit-array. For simple implementation, I use a list as a substitute for a bit array. It decleases the bloom filter's merit much so you shouldn't use the code in practical use.

The following list is the result of interactive execution on a python's shell.

>>> bf = BloomFilter()

>>> bf.add("hoge")

>>> bf.query("hoge")

True

>>> bf.query("hoga")

False

>>> bf.add("foo")

>>> bf.add("bar")

>>> bf.add("baz")

>>> bf.query("hoga")

False

>>> bf.query("foo")

True

When a string added, the bloom filter write '1' on the position of bit array corresponds to each hash value. When a string queried, the bloom filter checks each position and if and only if all of them are '1' it says 'True'. You can imagine more strings are added, the bit array has more '1', and bloom filter more possibly says 'True'. The probability can easily estimate. Given m is the size of bit array, k is the number of hash functions and n is the number of strings you added, the probability is (1 - exp(-float(k * n) / m)) ** k. In this case, m equals 1987 and k equals 3, so when n is 100 the prob. is 0.003 and when n is 1000 the prob. is 0.47. You have to increase the size of bit array if you want to add 1000 strings to a bloom filter.

## No comments:

Post a Comment