What is a Redis Bloom Filter? (Beginner’s Guide)

Okay, here’s a comprehensive beginner’s guide to Redis Bloom Filters, aiming for approximately 5000 words.

Redis Bloom Filters: A Beginner’s Guide (and Beyond)

Bloom filters are a fascinating data structure, and their integration into Redis makes them incredibly powerful and accessible for a wide range of applications. This guide will take you from the absolute basics of what a Bloom filter is all the way through to practical implementations and considerations within Redis. We’ll cover:

  1. What is a Bloom Filter (Conceptually)?
  2. Why Use a Bloom Filter (Use Cases)?
  3. How Does a Bloom Filter Work (Mechanics)?
  4. Bloom Filter Limitations (False Positives)?
  5. Redis and Bloom Filters: The RedisBloom Module
  6. Installing and Setting up RedisBloom
  7. Basic RedisBloom Commands (with Examples)
  8. Advanced RedisBloom Features (Scaling, Counting Bloom Filters)
  9. Choosing Parameters (Error Rate, Capacity)
  10. Practical Use Cases in Detail (with Code Snippets)
  11. Performance Considerations and Benchmarking
  12. Alternatives to Bloom Filters
  13. Bloom Filters vs. Other Redis Data Structures
  14. Common Mistakes and Pitfalls
  15. Conclusion and Further Resources

Let’s dive in!


1. What is a Bloom Filter (Conceptually)?

Imagine you have a massive collection of items – usernames, IP addresses, product IDs, anything. You frequently need to check if a new item is already in this collection. Traditionally, you might use:

  • A List/Array: Very slow for large datasets (O(n) search time – you potentially have to check every element).
  • A Hash Table/Dictionary: Fast (O(1) on average), but requires storing all the items in memory, which can be expensive for truly massive collections.
  • A Database Query: Relatively fast, but adds latency and load on your database.

A Bloom filter offers a different approach. It’s a probabilistic data structure designed for set membership testing. This means:

  • It tells you if an item is possibly in the set or definitely not in the set. This is the key difference!
  • It’s incredibly space-efficient. It can represent a massive set using a surprisingly small amount of memory.
  • It’s very fast for membership checks.

Think of it like a highly compressed, slightly fuzzy representation of your set. It’s like a “maybe” filter.

The Analogy:

Imagine a giant, opaque box. You can put balls into the box (adding items to the set). You can also shake the box and ask, “Is there a red ball in here?”

  • If the box says “No,” you can be 100% sure there’s no red ball.
  • If the box says “Yes,” there might be a red ball, but there’s also a chance the box is wrong (a false positive). Maybe it’s confusing a blue ball for a red one in its compressed representation.

The Bloom filter is that “box.” It trades perfect accuracy for extreme space and speed efficiency.


2. Why Use a Bloom Filter (Use Cases)?

Bloom filters are incredibly versatile. Here are some common scenarios where they shine:

  • Preventing Duplicate Content Submission: Imagine a social media platform. You want to prevent users from accidentally submitting the same post multiple times. A Bloom filter can quickly check if a post’s hash (a unique fingerprint) has already been seen.

  • Reducing Database Load: If you have a frequently accessed, but rarely updated, dataset (e.g., a list of valid usernames), you can use a Bloom filter as a cache pre-check. Before querying the database, check the Bloom filter. If it says “No,” you know the username isn’t valid, saving a database lookup.

  • Spam Filtering: Email services can use Bloom filters to track known spam email addresses or URLs. If a new email contains an item flagged as “possibly spam” by the filter, it can be routed to the spam folder for further inspection.

  • Web Crawling: Web crawlers (like Googlebot) need to avoid revisiting the same URLs repeatedly. A Bloom filter can store a massive set of visited URLs efficiently.

  • Content Recommendation Systems: To avoid recommending items a user has already seen, a Bloom filter can be used to track viewed items.

  • Distributed Caching: In a distributed caching system, a Bloom filter can quickly determine if a particular cache node might contain the desired data, reducing unnecessary network requests.

  • Network Intrusion Detection: Bloom filters can be used to track known malicious IP addresses or network traffic patterns.

  • Spell Checking Dictionaries can be large. If you want to embed it in something like a phone, then size is important. A Bloom filter could be used to find all misspelled words. The accuracy is that if the bloom filter says the word is not in the dictionary, then it is definitely misspelled. But if it says it is in the dictionary, then it is probably correct (but could be a false positive).

The common thread is that these applications benefit from:

  • Fast membership checks: The speed of a Bloom filter is crucial.
  • Space efficiency: The ability to represent a large set with minimal memory is a major advantage.
  • Tolerance for false positives: The applications are designed to handle the occasional false positive without significant consequences.

3. How Does a Bloom Filter Work (Mechanics)?

The magic of a Bloom filter lies in its clever use of hash functions and a bit array. Here’s a breakdown:

  • Bit Array: A Bloom filter starts with a bit array – a sequence of bits, all initially set to 0. Think of it as a long row of light switches, all initially off. The size of this array (the number of bits) is a crucial parameter that affects the filter’s accuracy and capacity.

  • Hash Functions: A Bloom filter uses multiple independent hash functions. A hash function takes an input (e.g., a string) and produces a numerical output (a hash value). Good hash functions distribute their outputs evenly, minimizing collisions (different inputs producing the same output).

  • Adding an Item:

    1. Take the item you want to add (e.g., a username).
    2. Feed the item into each of your hash functions. Each function will produce a different hash value.
    3. Treat each hash value as an index into the bit array.
    4. Set the bit at each of those indices to 1 (turn on the corresponding light switches).
  • Checking for Membership:

    1. Take the item you want to check.
    2. Feed it into the same hash functions you used for adding items.
    3. For each hash value, check the corresponding bit in the bit array.
    4. If any of the bits are 0, the item is definitely not in the set (because if it were, all those bits would have been set to 1).
    5. If all the bits are 1, the item is possibly in the set (but it could be a false positive).

Example:

Let’s say we have a Bloom filter with:

  • A bit array of size 10 (indices 0-9).
  • Two hash functions: hash1 and hash2.

  • Adding “apple”:

    • hash1("apple") = 3
    • hash2("apple") = 7
    • Set bits at indices 3 and 7 to 1. The bit array might look like: 0001000100
  • Adding “banana”:

    • hash1("banana") = 1
    • hash2("banana") = 3
    • Set bits at indices 1 and 3 to 1. The bit array now looks like: 0101000100 (Notice that bit 3 was already set).
  • Checking “apple”:

    • hash1("apple") = 3 (bit is 1)
    • hash2("apple") = 7 (bit is 1)
    • Result: Possibly in the set.
  • Checking “orange”:

    • hash1("orange") = 1 (bit is 1)
    • hash2("orange") = 5 (bit is 0)
    • Result: Definitely not in the set.
  • Checking “grape”:

    • hash1("grape") = 1 (bit is 1)
    • hash2("grape") = 7 (bit is 1)
    • Result: Possibly in the set. But we never added “grape”! This is a false positive.

Key Concepts Illustrated:

  • Multiple Hash Functions: Using multiple hash functions reduces the chance of collisions. If we only used one hash function, “banana” and “apple” might have collided (both hashing to 3), increasing the likelihood of false positives.
  • Bit Overlap: Different items can set the same bits. This is why false positives are possible.
  • No Deletion: You can’t reliably remove an item from a standard Bloom filter. If you try to reset bits to 0, you might inadvertently remove other items that also hashed to those bits. (There are variations like Counting Bloom Filters that address this, which we’ll discuss later.)

4. Bloom Filter Limitations (False Positives)?

The crucial trade-off of a Bloom filter is that it can produce false positives. It can tell you an item is possibly in the set when it actually isn’t. However, it will never produce a false negative (it will never say an item is not in the set when it actually is).

Understanding False Positives:

A false positive occurs when the hash functions for a new item happen to map to bits that were already set to 1 by other items. The more items you add to the Bloom filter, and the smaller the bit array, the higher the probability of false positives.

Factors Affecting False Positive Rate:

  • Size of the Bit Array (m): A larger bit array reduces the chance of collisions, lowering the false positive rate.
  • Number of Hash Functions (k): There’s an optimal number of hash functions. Too few, and you increase collisions. Too many, and you quickly fill up the bit array.
  • Number of Items Added (n): As you add more items, the bit array becomes more saturated, increasing the false positive rate.

The Formula:

The probability of a false positive (p) can be approximated by the following formula:

p ≈ (1 - e^(-kn/m))^k

Where:

  • p = Probability of a false positive
  • m = Size of the bit array (number of bits)
  • n = Number of items inserted
  • k = Number of hash functions
  • e = The base of the natural logarithm (approximately 2.71828)

This formula is an approximation, but it’s useful for understanding the relationships between the parameters.

Optimal Number of Hash Functions:

The optimal number of hash functions (k) that minimizes the false positive rate for a given m and n is:

k ≈ (m/n) * ln(2)

Practical Implications:

  • You need to choose your parameters carefully based on your expected number of items and your acceptable false positive rate.
  • You can calculate the expected false positive rate for a given configuration.
  • You often design for a specific false positive rate (e.g., 1%, 0.1%) and then calculate the required bit array size.

5. Redis and Bloom Filters: The RedisBloom Module

Redis, the popular in-memory data structure store, doesn’t natively include Bloom filters as a built-in data type. However, it’s incredibly extensible through modules. The RedisBloom module provides a powerful and efficient implementation of Bloom filters (and other probabilistic data structures).

Why RedisBloom?

  • Performance: RedisBloom is written in C and highly optimized for speed and memory efficiency. It leverages Redis’s underlying architecture for fast data access.
  • Ease of Use: It provides a simple and intuitive API for working with Bloom filters, integrating seamlessly with Redis commands.
  • Scalability: RedisBloom supports features like scaling Bloom filters, allowing you to handle growing datasets.
  • Additional Probabilistic Data Structures: Besides Bloom filters, RedisBloom also includes implementations of other useful probabilistic data structures like Cuckoo filters, Count-Min Sketch, and Top-K.

6. Installing and Setting up RedisBloom

There are several ways to get RedisBloom up and running:

  • Redis Stack (Recommended): The easiest way is to use Redis Stack, which bundles RedisBloom and other useful modules. You can download Redis Stack from the official Redis website or use Docker:

    bash
    docker run -p 6379:6379 redis/redis-stack:latest

  • Building from Source: You can download the RedisBloom source code from GitHub and compile it yourself. This gives you the most control but requires more technical expertise. Instructions are available on the RedisBloom GitHub repository.

  • Using a Package Manager: Some Linux distributions may have RedisBloom available through their package managers (e.g., apt, yum). However, the availability and version may vary.

  • Cloud Providers: Many cloud providers offer managed Redis instances that include RedisBloom as an option. This is often the simplest approach for production deployments. (e.g., Redis Enterprise Cloud, AWS MemoryDB for Redis, Azure Cache for Redis Enterprise).

Verifying Installation:

Once you have RedisBloom installed, you can verify it by connecting to your Redis instance (using redis-cli or a client library) and running the MODULE LIST command:

redis-cli
127.0.0.1:6379> MODULE LIST
1) 1) "name"
2) "bf"
3) "ver"
4) (integer) 20400

If you see “bf” (the short name for RedisBloom) in the output, the module is loaded correctly.


7. Basic RedisBloom Commands (with Examples)

RedisBloom provides a set of commands for interacting with Bloom filters. Here are the most fundamental ones:

  • BF.RESERVE {key} {error_rate} {capacity}: Creates a new Bloom filter.

    • {key}: The name of the key to store the Bloom filter under.
    • {error_rate}: The desired false positive probability (e.g., 0.01 for 1%).
    • {capacity}: The estimated number of items you’ll insert.
  • BF.ADD {key} {item}: Adds an item to the Bloom filter.

  • BF.MADD {key} {item} [item ...]: Adds multiple items to the Bloom filter.

  • BF.EXISTS {key} {item}: Checks if an item possibly exists in the Bloom filter. Returns 1 if it possibly exists, 0 if it definitely does not.

  • BF.MEXISTS {key} {item} [item ...]: Checks if multiple items possibly exist in the Bloom filter. Returns an array of 1s and 0s, corresponding to each item.

  • BF.INFO {key}: Returns information about the Bloom filter, including its capacity, size, number of hash functions, and number of items inserted.

Examples (using redis-cli):

“`

Create a Bloom filter named ‘users’ with a 1% error rate and capacity for 10000 items.

127.0.0.1:6379> BF.RESERVE users 0.01 10000
OK

Add some usernames.

127.0.0.1:6379> BF.ADD users alice
(integer) 1
127.0.0.1:6379> BF.ADD users bob
(integer) 1
127.0.0.1:6379> BF.MADD users charlie dave eve
1) (integer) 1
2) (integer) 1
3) (integer) 1

Check if usernames exist.

127.0.0.1:6379> BF.EXISTS users alice
(integer) 1
127.0.0.1:6379> BF.EXISTS users frank
(integer) 0
127.0.0.1:6379> BF.MEXISTS users bob zoe frank
1) (integer) 1
2) (integer) 0
3) (integer) 0

Get information about the Bloom filter.

127.0.0.1:6379> BF.INFO users
1) “Capacity”
2) (integer) 10000
3) “Size”
4) (integer) 143776
5) “Number of filters”
6) (integer) 1
7) “Number of items inserted”
8) (integer) 5
9) “Expansion rate”
10) (integer) 2
11) “error_rate”
12) “0.01”
“`

Python Example (using redis library):

“`python
import redis

Connect to Redis (assuming Redis Stack is running on default port).

r = redis.Redis(host=’localhost’, port=6379)

Create a Bloom filter.

r.execute_command(‘BF.RESERVE’, ‘users’, 0.01, 10000)

Add items.

r.execute_command(‘BF.ADD’, ‘users’, ‘alice’)
r.execute_command(‘BF.MADD’, ‘users’, ‘bob’, ‘charlie’)

Check for existence.

print(r.execute_command(‘BF.EXISTS’, ‘users’, ‘alice’)) # Output: 1
print(r.execute_command(‘BF.EXISTS’, ‘users’, ‘frank’)) # Output: 0

Get info

print(r.execute_command(‘BF.INFO’, ‘users’))

Clean up example

r.delete(‘users’)
“`

These examples demonstrate the core functionality of creating, adding to, and querying a Bloom filter in Redis.


8. Advanced RedisBloom Features (Scaling, Counting Bloom Filters)

RedisBloom goes beyond the basic Bloom filter implementation, offering features for more complex scenarios:

  • Scaling Bloom Filters (BF.RESERVE with expansion):

    The standard Bloom filter has a fixed capacity. If you insert more items than the initial capacity, the false positive rate will increase beyond your desired threshold. RedisBloom addresses this with scaling Bloom filters.

    You can specify an expansion parameter when creating the filter (using BF.RESERVE). When the filter reaches its capacity, RedisBloom automatically creates a new, larger Bloom filter and adds it to the chain. Subsequent BF.ADD operations will add items to the appropriate filter in the chain. BF.EXISTS will automatically check all filters in the chain. This maintains the desired error rate even as the number of items grows.

    127.0.0.1:6379> BF.RESERVE scalable_filter 0.01 1000 expansion 2
    OK

    In this example, when scalable_filter reaches 1000 items, a new filter with a capacity of 2000 (1000 * 2) will be created. The expansion parameter controls the growth factor.

  • Counting Bloom Filters (CF.RESERVE, CF.ADD, CF.COUNT, CF.DEL):

    Standard Bloom filters only allow you to add items and check for their possible existence. You can’t remove items or count how many times an item has been added. Counting Bloom filters address this.

    Instead of a simple bit array, a Counting Bloom filter uses an array of counters. Each counter is incremented when an item is added and decremented when an item is (supposedly) deleted.

    • CF.RESERVE {key} {capacity}: Creates a new Counting Bloom filter. (Note: Error rate is not a parameter here; it’s implicitly determined by the counter size and number of hash functions.)
    • CF.ADD {key} {item}: Increments the counters associated with the item.
    • CF.COUNT {key} {item}: Returns the (approximate) number of times the item has been added.
    • CF.DEL {key} {item}: Decrements the counters associated with the item.

    127.0.0.1:6379> CF.RESERVE counting_filter 1000
    OK
    127.0.0.1:6379> CF.ADD counting_filter apple
    (integer) 1
    127.0.0.1:6379> CF.ADD counting_filter apple
    (integer) 2
    127.0.0.1:6379> CF.ADD counting_filter banana
    (integer) 1
    127.0.0.1:6379> CF.COUNT counting_filter apple
    (integer) 2
    127.0.0.1:6379> CF.DEL counting_filter apple
    (integer) 1
    127.0.0.1:6379> CF.COUNT counting_filter apple
    (integer) 1

    Caveats of Counting Bloom Filters:

    • Counter Overflow: The counters have a limited size. If you add an item too many times, the counter can overflow, leading to incorrect counts.
    • Deletions Can Introduce Errors: If you delete an item that wasn’t actually added, you can decrement counters that were set by other items, leading to false negatives (underreporting the count of other items).
    • Space Overhead: Counting Bloom filters require more space than standard Bloom filters because they store counters instead of single bits.

    Counting Bloom filters are useful when you need approximate counts and the ability to (imperfectly) remove items.

  • Cuckoo Filters
    RedisBloom also offers Cuckoo Filters which, like Bloom Filters, provide a probabilistic data structure for checking set membership. However, they support deletion of items, something not possible with a standard bloom filter, and, for applications that store many items and have a low error tolerance, often use less space than Bloom filters.

    *   **`CF.RESERVE {key} {capacity} [bucket_size] [max_iterations] [expansion]`** - Create an empty cuckoo filter with a single sub-filter for the specified number of `items`.
    *   **`CF.ADD {key} {item}`** - Add an item to the cuckoo filter.
    *   **`CF.ADDNX {key} {item}`** - Add an item to a cuckoo filter if the item did not exist previously.
    *   **`CF.INSERT {key} [CAPACITY {capacity}] [NOCREATE] ITEMS {item} ...`** - Adds one or more items to a cuckoo filter. Creates the filter if it does not exist.
    *   **`CF.INSERTNX {key} [CAPACITY {capacity}] [NOCREATE] ITEMS {item} ...`** - Adds one or more items to a cuckoo filter, by default creating it automatically if it does not yet exist.
    *   **`CF.EXISTS {key} {item}`** - Check if an item exists in a Cuckoo Filter
    *   **`CF.DEL {key} {item}`** - Deletes an item once from the filter.
    *   **`CF.COUNT {key} {item}`** - Return the number of times an item may be in the filter.
    *   **`CF.INFO {key}`** - Return information about a key
    

9. Choosing Parameters (Error Rate, Capacity)

Selecting the right parameters for your Bloom filter is critical for achieving the desired balance between accuracy (false positive rate) and space efficiency. Here’s a step-by-step approach:

  1. Determine Your Acceptable False Positive Rate (p): This depends entirely on your application. For a cache pre-check, a 1% false positive rate might be acceptable. For spam filtering, you might want a much lower rate (e.g., 0.1% or even 0.01%).

  2. Estimate the Number of Items You’ll Insert (n): This is your expected capacity. If you anticipate significant growth, consider using a scaling Bloom filter.

  3. Calculate the Required Bit Array Size (m): You can use the formula (or an online Bloom filter calculator) to determine the necessary m to achieve your desired p and n. Rearranging the formula to solve for m, we get:
    m = -(n * ln(p)) / (ln(2)^2)

  4. Calculate the Optimal Number of Hash Functions (k): Use the formula:
    k ≈ (m/n) * ln(2)

  5. Round Up: m should be an integer (you can’t have fractions of bits). Round m up to the nearest whole number. k should also be an integer.

Example:

Let’s say we want:

  • p = 0.01 (1% false positive rate)
  • n = 10000 (expected number of items)

  • Calculate m:
    m = -(10000 * ln(0.01)) / (ln(2)^2) ≈ 95850.5
    Round up to m = 95851 bits.

  • Calculate k:
    k ≈ (95851 / 10000) * ln(2) ≈ 6.64
    Round to k = 7 hash functions.

  • Use these values with BF.RESERVE:
    127.0.0.1:6379> BF.RESERVE myfilter 0.01 10000
    OK

    RedisBloom will automatically choose a suitable number of hash functions based on the provided error_rate and capacity. You can use BF.INFO to confirm the actual parameters used.

Online Calculators:

Several online Bloom filter calculators can simplify this process. They allow you to input your desired parameters and will calculate the required bit array size, the optimal number of hash functions, and the expected false positive rate. A popular one is:

Important Considerations:

  • Memory Usage: The bit array size (m) directly determines the memory used by the Bloom filter. Keep this in mind, especially if you’re working with limited memory resources.
  • Performance: More hash functions (k) mean more computation per operation, which can slightly impact performance. The optimal k provides the best balance.
  • Scaling: If you’re unsure about the future growth of your dataset, use a scaling Bloom filter (expansion parameter) to avoid having to re-create the filter with different parameters later.

10. Practical Use Cases in Detail (with Code Snippets)

Let’s explore some concrete examples of how to use Bloom filters in Redis for different scenarios.

A. Preventing Duplicate Content Submission (Social Media):

“`python
import redis
import hashlib

r = redis.Redis(host=’localhost’, port=6379)

def post_exists(post_content):
“””Checks if a post with similar content likely already exists.”””
# Create a hash of the post content (use a strong hash function like SHA-256).
post_hash = hashlib.sha256(post_content.encode(‘utf-8’)).hexdigest()
# Check if the hash exists in the Bloom filter.
return r.execute_command(‘BF.EXISTS’, ‘post_hashes’, post_hash)

def add_post(post_content):
“””Adds a post’s hash to the Bloom filter.”””
post_hash = hashlib.sha256(post_content.encode(‘utf-8’)).hexdigest()
r.execute_command(‘BF.ADD’, ‘post_hashes’, post_hash)

Initialize the Bloom filter (do this once, e.g., on application startup).

r.execute_command(‘BF.RESERVE’, ‘post_hashes’, 0.001, 1000000) # 0.1% error, 1M capacity

Example usage:

new_post = “This is a new and exciting post!”

if post_exists(new_post):
print(“This post might be a duplicate. Please review.”)
else:
add_post(new_post)
print(“Post added successfully.”)
# … proceed with saving the post to the database …

clean up example

r.delete(‘post_hashes’)
“`

Explanation:

  1. Hashing: We use SHA-256 to create a unique “fingerprint” of the post content. This ensures that even slight variations in the content will result in different hashes.
  2. Bloom Filter: We use a Bloom filter (post_hashes) to store the hashes of previously submitted posts.
  3. post_exists(): This function checks if a new post’s hash is possibly already in the filter.
  4. add_post(): This function adds a new post’s hash to the filter.
  5. Initialization: We initialize the Bloom filter with a low error rate (0.1%) and a large capacity (1 million posts).
  6. Duplicate Check: Before saving a new post to the database, we check if it’s likely a duplicate. If the Bloom filter returns 1, we can prompt the user to review their post or take other appropriate action.

B. Reducing Database Load (Username Availability):

“`python
import redis

r = redis.Redis(host=’localhost’, port=6379)

def is_username_available(username):
“””Checks if a username is likely available (not already taken).”””
if r.execute_command(‘BF.EXISTS’, ‘registered_usernames’, username):
# Username is possibly taken. Need to confirm with the database.
return check_database_for_username(username) # Placeholder for your DB check
else:
# Username is definitely available.
return True

def register_username(username):
“””Registers a username (adds it to the Bloom filter and database).”””
if is_username_available(username):
r.execute_command(‘BF.ADD’, ‘registered_usernames’, username)
# … proceed with adding the username to the database …
print(f”Username ‘{username}’ registered successfully.”)
return True
else:
print(f”Username ‘{username}’ is already taken.”)
return False

def check_database_for_username(username):
“””Placeholder for your actual database query to confirm username availability.”””
# This should query your database to definitively check if the username exists.
# For this example, we’ll simulate it.
existing_users = [“alice”, “bob”, “charlie”] # Replace with your database query
return username not in existing_users

Initialize the Bloom filter (load existing usernames from the database).

r.execute_command(‘BF.RESERVE’, ‘registered_usernames’, 0.01, 100000) # Example parameters

In a real application, you would load all existing usernames from your database

into the Bloom filter during initialization. For example:

for username in get_all_usernames_from_database():

r.execute_command(‘BF.ADD’, ‘registered_usernames’, username)

Example usage:

print(is_username_available(“david”))
print(is_username_available(“alice”))
register_username(“eve”)
register_username(“alice”)

clean up example

r.delete(‘registered_usernames’)
“`

Explanation:

  1. is_username_available(): This function first checks the Bloom filter. If the filter says “No” (0), we know the username is definitely available, saving a database query. If it says “Yes” (1), we must query the database to confirm, as it could be a false positive.
  2. register_username(): This function registers a new username by adding it to both the Bloom filter and the database (after checking availability).
  3. Initialization: It’s crucial to load all existing usernames from your database into the Bloom filter when your application starts. Otherwise, the filter won’t be accurate.
  4. check_database_for_username(): This is a placeholder for your actual database query. This is the definitive check for username availability.

C. Web Crawler (Avoiding Duplicate URLs):

“`python
import redis
import urllib.parse

r = redis.Redis(host=’localhost’, port=6379)

def have_visited_url(url):
“””Checks if a URL has likely been visited before.”””
# Normalize the URL (remove query parameters, fragments, etc., if desired).
normalized_url = normalize_url(url)
return r.execute_command(‘BF.EXISTS’, ‘visited_urls’, normalized_url)

def mark_url_as_visited(url):
“””Marks a URL as visited by adding it to the Bloom filter.”””
normalized_url = normalize_url(url)
r.execute_command(‘BF.ADD’, ‘visited_urls’, normalized_url)

def normalize_url(url):
“””Normalizes a URL (example: removes query parameters and fragments).”””
parsed_url = urllib.parse.urlparse(url)
# Return only the scheme and netloc (you can customize this).
return f”{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}”

Initialize the Bloom filter.

r

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top