Bloom Filters - A Smarter Way to Check If "We've Seen This Before"

Jul 10, 2025•5 min read•By Jinita Patel

When you build systems like signups or search engines, one common question keeps popping up:

"Have we seen this before?"

This is key for:

Checking if an email or username is already registered
Avoiding duplicate URLs in a crawler
Speeding up cache checks

Let’s start by looking at how we’d solve this problem traditionally — and why that doesn’t scale.

Thinking Boy

The Conventional Approach

When you want to check if an email, username, or URL already exists, the first approach is often a hash table.

Example

Store each item in a HashSet or Map
On insert or lookup, hash the item and check if it's present

This works fine at a small scale. But as data grows into millions or billions, challenges appear.

Common Strategies

Hash Tables:

Fast lookup (O(1))
Problem: High memory usage; every key must be stored fully

Sharding:

Distribute data across multiple servers
Problem: Introduces network latency and coordination overhead

Caching:

Use Redis or Memcached for hot keys
Problem:Limited memory; cache evictions make lookups unreliable

Core Issue

All these methods store full keys, making them memory and latency-expensive at scale.

Why Use Bloom Filters?

A Bloom Filter is a space-efficient, probabilistic data structure that answers:

This item is definitely not in the set

OR
It might be in the set

The keyword here is: “might”.

Simple Analogy: The Stamp Card

Imagine a bakery giving stamp cards.

A customer visits → you stamp spots based on their name
They return → you check if those stamps exist

If none match → they’re definitely new
If some match → they might have visited before

That’s how Bloom Filters behave.

How Bloom Filters Work

A Bloom Filter consists of:

A bit array of size m, initialized to 0
k independent hash functions

To add an item

Hash the item with k functions
Get k positions in the bit array
Set all those bits to 1

To check an item

Hash again with the same k functions
If any bit is 0 → item is definitely not in the set
If all bits are 1 → item might be in the set

Bloom Filter

Benefits of Bloom Filters

Memory efficient (no full keys stored)
Fast lookups (bitwise operations)
No false negatives (if it says “not present,” it’s accurate)

Real-World Use Cases

Signup systems (check if username/email is taken)
Web crawling (avoid revisiting URLs)
Databases (Bigtable, HBase skip disk I/O)
CDNs (check key presence)

False Positives vs False Negatives

Type	Can Happen?	Explanation
False Positive	Yes	May say item exists when it doesn’t
False Negative	No	Never misses an item that was actually added

Understanding False Positives in Bloom Filters

False Positive

The diagram above demonstrates how Bloom Filters handle lookups and how false positives can occur:

The keys foo, bar, and apple are added to the set.
Each key is hashed using three different hash functions, and the corresponding bit positions are set to 1 in the bit array.
When we check for the key cat, one of its hashed positions points to a 0, so we know for sure that cat was never added — a definite "not present".
However, the key kite hashes to three positions that are already marked as 1 by other keys.
Since all its bits are 1, the Bloom Filter returns true, even though kite was never added.

This is a false positive — the filter incorrectly thinks kite exists.

Bloom Filters are designed this way:
they never miss an inserted item, but they may mistakenly say something exists due to overlapping bits.

Limitations of Bloom Filters

Limitation	Why It Happens
No Deletion	Can’t undo which item set the bit; removing breaks integrity
No Retrieval	Only stores presence indicators, not data
False Positives	Bit collisions grow as more items are added

Deleting from a Bloom Filter? Use Counting Bloom Filters

How it works:

Replace bit array with a counter array
On insert → increment counters
On delete → decrement counters

Downsides:

More memory
Still allows false positives

What Are Ribbon Filters?

Ribbon Filters, developed by Meta, are a more modern alternative to Bloom Filters.

They use structured hashing to improve accuracy and reduce memory further — especially for static data sets.

Advantages of Ribbon Filters

More compact than Bloom
Lower false positive rate
Tailored for static sets like search indexes or key maps
Efficient with billions of entries

Comparison Table

Feature	Bloom Filter	Counting Bloom	Ribbon Filter
Memory Efficient	Yes	Medium	Very
False Positives Possible	Yes	Yes	Less frequent
False Negatives	Never	Never	Never
Supports Deletion	No	Yes	No (static only)
Stores Items	No	No	No
Performance at Scale	Great	Okay	Excellent

Final Takeaway

When systems need to answer “Have we seen this before?” at massive scale, storing every item isn’t practical. Bloom Filters offer a fast, memory-efficient way to approximate set membership.

They’re not perfect — they can't delete or retrieve — but they’re powerful for “is-it-there?” questions in real-time systems.

Use:

Bloom Filter → for fast, low-memory lookups
Counting Bloom → when deletion is necessary
Ribbon Filter → for compact, static, high-scale filtering

Did You Know?

Bloom filters are used in Google Chrome to check if a website has been visited before, improving browsing speed.