Understanding Bloom filters by building one

hoppla · on May 29, 2022

Last time I used bloom filters, was when I had MANY BIG files containing dns queries and responses. Instead of putting the data in a database and create indexes (project would not the defend the cost of having a database), I created a bloom filter per file. Whenever I needed to search for a query or response, I would quickly know which files I needed to look in.

I think I used this module https://github.com/remram44/python-bloom-filter

neoneye2 · on May 29, 2022

In my spare time project I use a bloom filter, here: https://github.com/loda-lang/loda-rust/blob/develop/rust_pro...

I use 4 bloomfilters for rejecting programs that aren't present in the oeis database. First 10 terms are computed, and reject if the terms aren't present in the 1st bloomfilter. Next 10 terms more are computed, and rejected if the 20 terms aren't present in the 2nd bloomfilter. All the way up to 40 terms. If a mined program has possible 40 correct terms, there is a chance that the program may compute the entire oeis integer sequence correct.

Shameless self promotion: If you have cpu resources to spare, then please contribute to LODA, an open source project for mining oeis integer sequences. https://loda-lang.org/

sorokod · on May 29, 2022

Google Guava has industrial strength implementation documented here[1] with sources here [2].

Tests and hints on how to drive this: [3]

[1] https://guava.dev/releases/snapshot-jre/api/docs/com/google/...

[2] https://github.com/google/guava/blob/master/guava/src/com/go...

[3] https://github.com/google/guava/blob/master/guava-tests/test...

ashdnazg · on May 29, 2022

I've used Bloom filters in a Battlecode [1] AI. We had a shared array where reads and writes are expensive, so a bloom filter was used in order to not waste time on writing something that's already written there.

[1] https://battlecode.org/

celeritascelery · on May 29, 2022

I have wondered if bloom filters could be used as a cache on hash tables. It really comes down to if the bloom filter hashes can be computed significantly faster then doing the hashmap lookup (which includes hashing and probing).

bo1024 · on May 29, 2022

1. Remember that a bloom filter only answers yes/no questions. So it can't cache hash table values, but it can cache set membership (is X in the hash table?).

2. Rather than speed per se, a better reason is space. You have a small bloom filter that lives close to the query. The query asks "is this entry in the hash table?" and get a local answer from the bloom filter, either "no" or "maybe yes". The answer is fast not because the hashing is fast, but because the small size allows it to be local and fast. Only for the "maybe yes" answers do you look it up in the hash table which is large and lives far away. This saves a lot of time if most answers are "no".

dspillett · on May 29, 2022

> Remember that a bloom filter only answers yes/no questions.

Will, to nit-pick: the results from querying a bloom filter are either “no” or “maybe”, where the strength of the maybe varies from “are you feeling lucky?” to “almost definitely” depending on how well it is designed for the data it has been populated with.

smegsicle · on May 29, 2022

the answer is no/maybe, but it's still a yes/no question as typically used

dspillett · on May 29, 2022

I was about to make a very pedantic point, but thinking about it I'm actually wrong… If the question is phrased oddly then it does give yes/no answers: “Is <item> definitely not in <set>?”.

vaibhavsagar · on May 29, 2022

Doing a hashmap lookup is already very much like using a bloom filter: for any addressing scheme, if an element exists at a particular location, then either the key exists or it's a false positive. If no element exists, then there's no way the key exists. Adding an extra bloom filter would then come down to trading off space usage against a different probability of false positives, in addition to an extra lookup, and I think one would have to contrive an extremely unlikely scenario for it to make sense.

tialaramex · on May 29, 2022

Sometimes "Does anything in this category exist?" is constantly needed and must be as fast as possible - especially if the answer is often "No", while "What is it exactly?" is more rarely needed and it's OK that it's slower.

The Bloom filter gives you quick "No" answers for much less size than a hash table, in exchange for not being able to answer the second question at all.

We built distributed database software, years ago when 40GB was a lot of data, and that used Bloom filters so that if you asked "Show me everything about vaibhavsagar, tialaramex, dang, and celeritascelery" it could do a Bloom filter check, identify that there isn't any data about celeritascelery and dang and immediately cut in half the work to be done, before going to the disk to pull in actual hash tables where records for tialaramex and vaibhavsagar would exist if the positive from the Bloom filter wasn't a false positive.

The UX is nice too because it can make typos instant. For every municipality in Texas, show me the CesusCount. Instant zero records. Oh, right, not CesusCount I wanted CensusCount with an N.

aaronblohowiak · on May 29, 2022

that is a very common application of them, afaiu. though, the size and locality of the hashtable plays a significant role here. i may be taking a too generalized view of what you mean by "hash table"

fxtentacle · on May 29, 2022

Yes. I forgot where, but I saw one game engine have bitmaps that where exactly 64 bytes as a super quick pre-check that fits into one CPU cache line => much faster than RAM access.

newbye3 · on May 29, 2022

My first impression is that a better algorithm is obtained with one vector for each hash function, but now I am lazy to do the math.

smegsicle · on May 29, 2022

it would be a bit more precise of course, but would use three times the memory at any hash size...