What are your real-world applications of this versatile data structure?

They are useful for optimization in databases like sqlite and query engines like apache spark. Application developers can use them as concise representations of user data for filtering previously seen items.

The linked site gives a short introduction to bloom filters along with some links to further reading:

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

  • the_sisko
    link
    fedilink
    English
    arrow-up
    7
    ·
    1 year ago

    A classic use for them is spam filtering.

    Suppose you have a set of spam detection systems/rules which are somewhat expensive to execute, eg a ML model or keyword blocklist. Spam tends to come in waves, and frequently it can be as simple as reposting the same message dozens of times.

    Once your systems determine a piece of content is spam (or you manually flag content), it’s a good idea to insert the content into a bloom filter. This means that future posts of the identical content will be flagged without needing to execute the expensive checks, especially if there’s a surge of content stressing your systems.

    Since it’s probabilistic, you can’t use this unless you have some sort of manual reviewing queue or system, as it’s possible for false positives to be flagged. However, you can also run more intensive checks once you’ve flagged content, to detect false positives.

    The false positives can also be a feature, not a bug: with careful choice of hash functions, your bloom filter can actually detect slightly modified content, since most of the hashes may still be the same.

    I’ve worked at companies which use this strategy so it’s very real world.

    • noli@programming.dev
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      Cool, so in this case your filter is basically a classifier ML model. How would you set the hash functions then though?

      • the_sisko
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Usually it’s a bunch of different string hashes of the text content. They could be different hashing algorithms, but it’s more common to take a single hash algorithm and simply create a bunch of hash functions that operate on different parts of the data.

        If it’s not text data, there’s a whole bunch of other hashing strategies but I only ever saw bloom filters used with text.

  • noli@programming.dev
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    1 year ago

    I know they are used in google’s BigTable. All data there is stored in seperate SSTables and you can specify that a locality group should have bloom filters generated for its SSTables. Apparently cassandra has them too.

    Both are the same general application though and you already mentioned databases.

    I did think about using them at some point for authentication purposes in a webservice. The idea being to check for double uses of a refresh token. This way the user database would need to store only a small amount of extra storage to check for the reuse of a refresh token and if you set the parameters accordingly, the false positives are kind of a benefit in that users cannot infinitely refresh and they actually have to reauthenticate sometimes.

    Edit to add: I also read a paper recently that uses a datastructure called a collage that is closely related to bloom filters to perform in-network calculations in a sensor network. If I understand correctly, the basic idea there is that every node in the network adds a bit to the datastructure while it is in transit, so data from the entire network is aggregated. The result can then be fed to a classifier ML model. (Source: Oostvogels, J., Michiels, S., & Hughes, D. (2022). One-Take: Gathering Distributed Sensor Data Through Dominant Symbols for Fast Classification. )

    • Reader9@programming.devOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Collage sounds really interesting , will check it out. Another variation on bloom filter I recently learned about is count-min-sketch. It allows for storing/incrementing a count along with each key, and can answer “probably in set with count greater than _”, “definitely not in set”.

      Thanks for adding more detail on the DB use-cases!

  • Redkey@programming.dev
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    I learned about Bloom filters from an article discussing how old systems and algorithms shouldn’t be forgotten because you never know when they’ll come in handy for another application. The example they gave was using Bloom filters to reduce data transmission for MMOs; break your world into sectors and just send everyone a Bloom filter of objects mapped to sectors, then the client can request more detail only for objects that are within a certain range of the individual PC.

    • noli@programming.dev
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      That’s actually a really nice application, in this case to reduce bandwidth constraints as opposed to the usual use case of memory constraints!

  • Reader9@programming.devOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    I haven’t used them in Spark directly but here’s how they are used for computing sparse joins in a similar data processing framework:

    Let’s say you want to join some data “tables” A and B. When B has many more unique keys than are present in A, computing “A inner join B” would require lots of shuffling if B, including those extra keys.

    Knowing this, you can add a step before the join to compute a bloom filter of the keys in A, then apply the filter to B. Now the join from A to B-filtered only considers relevant keys from B, hopefully now with much less total computation than the original join.