Bloom Filters in PHP

Stuffing an elephant into a hat…

Let's imagine you have built a music app like Spotify. You've finally grown this thing to sizeable amount of users and you have a decent number of titles in your content library. Let's also say this app has social elements to it so your users can connect with their facebook friends or twitter followers. Now, let's say each time your users play a song in your app you want to ask the question Which of this user's friends have NOT listened to this song yet? The intention being that you may recommend that song to them if they haven't listened to it.

The question is not so simple at scale, because it introduces exponential elements of both time and space complexity, as the number of users and number of titles available, grow. The biggest worry is how much memory it will take your API to respond to the question when the user hits the play button. If the user has thousands of friends and they've each listened to thousands of songs that's millions of permutations were we to just use a needle/haystack search. The space complexity is also huge! So how do we fit potentially millions of elements into memory efficiently and still be able to query this data quickly?

One solution to this problem is to use a data structure known as a bloom filter. A bloom filter is basically a very space-efficient hash set with probabilistic tendency. If you aren't familiar with a hash set or sets in general, let's do a quick review of what they mean.

Sets, Hash Sets, and Hash Maps

A set is basically just a collection of unique elements. A hash set is a more space efficient way of storing a set of elements such that we only store the output of a one-way hash function of this element. This allows us to ask the question does X belong to the set, very quickly, and efficiently such that the set query has O(1) time complexity and the space efficiency is fixed by the hash function.

We already know of a data structure with similar properties in PHP today. It's called the Array. Since PHP arrays are just ordered hash maps, the keys in a PHP array are just a hash set that map to values. So if we removed the map part of that we would have what's known as a hash set. This means that when we do $arr["foo"] in PHP, PHP doesn't have to iterate the entire set of keys in the array to find "foo". Instead, the key "foo" is just hashed and the thing we look for is the hash itself and this gives us an O(1) search for the key.

OK, for those of you who are pedantic you know I'm lying here because technically the implementation of this in PHP is a doubly linked list (and in PHP 7 quite possibly a flattened array in most cases), so that's not entirely accurate, but hey I'm trying to make a relate-able point here!

Great, so now that we have an understanding of what sets, hash sets, and hash maps are we can explore why a bloom filter is an even more efficient representation of a set than a hash map or a hash set.

Introduction to bloom filters

It's important to note that a bloom filter sacrifices accuracy for space efficiency. This is because the premise of the bloom filter is to compress a hash.


Wait a minute...
What?!
Why would you compress a fixed-size output of an arbitrary-length input?
Well, this is why I liken the concept to that of trying to pull an elephant out of a tiny little hat. Not only do we shrink the arbitrary length input down to a fixed size hash, but we also try to shrink the hash down to a single bit.


The idea is that if you use multiple hash functions and store only the boolean interpretation of the hash's presence in the set (i.e. either the hash exists or it doesn't exist), and not the hash value itself, then what we end up with is an efficient probabilistic data structure that can ascertain, within a given–probability of error, whether or not the element exists in the set. This has a unique property, however. In that a bloom filter can provide false positives, but never false negatives.

In other words, a bloom filter can either tell you this element is probably in the set or this element is definitely not in the set.

Another interesting characteristic of the bloom filter is that it typically only provides a way for you to set and check elements. In other words, you can't actually delete elements from the set once they're set. However, there are other ways to achieve this by using a counting filter, for example. But for now we'll just focus on learning about the traditional bloom filter since our music app use case clearly doesn't require us to delete things from the set. Because, you can't exactly un-listen to a song once you've listened to it, right?

A contrived example

So to really grok the concept behind bloom filters let's start with a very rudimentary example of how this would work. Let's imagine that we have 8 light bulbs that represent a bloom filter. In their initial state, all of the light bulbs are turned off. Once we add an element to the bloom filter we will turn one or more of these light bulbs on according to our hash functions.

Let's say our bloom filter has just 2 hash functions to start. Let's call them h1() and h2(). So for every element we add to the bloom filter we run our input through each of these hash functions and what we get back is a number between 1 and 8. We simply find the position of the light bulb using that number and then turn it on. If it's already on we just leave it on.

So let's say the input "A" to the hash function h1() gives us 1. Then h2("A") gives us 7. So let's turn on light bulbs 1 and 7.

Now if we want to check whether or not "A" exists in our bloom filter we simply run the input back through our hash functions and ask are these light bulbs all turned on? If even one of them is not on then the element definitely does not exist in the set.

However, do note that different inputs could hash to the same positions. So just because all of the light bulbs are turned on doesn't necessarily mean the element is definitely in the set. It just means it's probably in the set. For example, let's say we run "B" through our hash functions and arrive at 4 and 1. Light bulb 1 is already turned on. So we leave it on and turn on light bulb 4.

Now let's suppose we ask the question is "C" in the set? If "C" hashes to 4 and 7 the answer will be yes, the element probably exists. However, that's only a result of both "A" and "B" turning on light bulbs 4 and 7 even though we never added "C" to the set.

This is what makes a bloom filter probabilistic. Because we can actually calculate to a certain probability how likely it is for us to get a false positive. Given that we know the size of the bloom filter and the number of hash functions used, we can calculate the probability that a light bulb is not turned on by any of the hash functions (and assuming all the hash functions have even distribution over the light bulbs) as (1 - 1/n)k where n is the number of light bulbs in the bloom filter and k is the number of hash functions used. To get the probability that a membership query will return a false positive in the set we need to also consider the number of elements already inserted into the set along with the bloom filter's size and number of hash functions used. We can calculate this as (1 - (1 - 1/n)(k * j))k where j is the number of elements already inserted into the bloom filter.

OK, because I know the mathematicians will hound me over this one, the above formula is only an approximation of the probability of false positives. It's not entirely accurate because it assumes that disjointed probability. Since independent events don't necessarily conclude us reaching a collision in this scenario (as demonstrated above by our earlier example). However, it's close enough for our purposes.

Meaning that in our example here, there is a ~17% chance that we will have a false positive after two elements are inserted into our bloom filter or a ~30% chance of a false positive after three elements are inserted into our bloom filter. Which is obviously pretty high, but you could lower that probability by increasing the size of the bloom filter (n) and the number of hash functions (k) used.

A more practical example

To see how this applies in practice, let's turn our light bulbs into bits. If we consider that a bit (which represents either a 1 or 0, i.e. either an on or an off state) tells us that a hash is either set or not set in the bloom filter, we can get a binary representation of the hash. In PHP this can be achieved easily by using a string. Since strings in PHP are just 256 bit ordered bytes (i.e. 8 bits per character). This gives a relatively easy way to initialize a bloom filter of any size (padded to 8 bits) by generating a string of ceil($n / 8) bytes, where $n represents the number of bits in the bloom filter.

To calculate how many bits we will need we need to know in advance how many elements we plan on storing in the bloom filter. Since the bloom filter is already pretty cheap in terms of memory we can easily over-estimate this number without much penalty. But how do we know how many bits to use for a given number of elements?

To do this accurately we first need to consider the desired probability of false positives in our bloom filter. For example, let's say we want to stick with a 1% probability of false positives (represented by 0.01 as a decimal or $percent / 100). We can use the formula $m = (($n * log($p)) / (log(2) ** 2)) * -1 where $n is the number of elements we want to be able to store in the bloom filter and $p is the probability of false positives in the bloom filter. We use the natural logarithm function as log(2, M_E) where M_E is the irrational constant 2.718281828 used as the natural base of the logarithm.

So a bloom filter that will store say 100 elements with a 1% probability of false positives will require 958.50583773674 bits, or rounded up to a whole number as 959 bits. To represent this as a string in PHP padded to the nearest octet we say str_repeat("\0", ceil($m / 8)) such that what we end up with is a string of 120 null bytes (all the bits are off). Enough space to fit our 100 elements in the bloom filter with only a 1% probability of false positives occurring throughout.

Now the question is how do we figure out how many hash functions we will need for this optimal 1% probability of false positives occurring. To do this we take the size of the bloom filter in bits and the number of elements and plug them into the formula $k = $m / $n * log(2) where $m is the size of the bloom filter in bits (which we just came up with above). Now $k represents our optimal number of hash functions needed to maintain a 1% probability of false positives. So this means we will need 6.6438561897747 hash functions, or 7 hash functions rounded up to the nearest whole number.

Finally, we take our null-initialized string and simply flip the respective bits, indicated by the hashes produced from our hash functions, and we have a bloom filter! Easy as pie, right?

Well, not quite....
Now comes the really fun part where we get to do some neat arithmetic!


First, you need to figure out a way to size the output from the hash functions to fit our bloom filter. Because the intention is to always get a number from the hash function between 0 and 958 (as indicated by the bloom filter's number of available bits), we use clock-arithmetic to do this.

Take, for example, the output from md5("Hello PHP", true) as an arbitrarily sized binary number. If we simply unpack this number into an integer space and use a modulus of $m what we end up with is a number between 0 and 958, which can represent the offset of the bit in our bloom filter that needs to be flipped on, or checked. So unpack("J", md5("Hello PHP", true)) would give us the number 14213587061305870973 represented as an unsigned long long (64 bit big endian first) integer. If we do 14213587061305870973 % 959 we should get 203. So we simply have to find the bit at offset 203 in our bloom filter and turn it on. We can get the byte offset if we divide 203 / 8, which gives us $bloomfilter[25], or the byte at offset 25 of our bloom filter string. Then if we take the remainder of that division 203 % 8 we know to flip the 3rd bit at that offset. So ord($bloomfilter[25]) | (2 ** 3).

Of course, the wise folk of PHP know that you can't get an unsigned integer representation in a native PHP integer data type. As such my example above of unpacking the md5 hash into an integer space, when observed though a PHP integer type, provides a negative value. You can see that it's still correct when represented as an unsigned value, however, with something like sprintf("%u", ...unpack("J", md5("Hello PHP", true))) which would give you the number above instead of the -4233157012403680643 you might be getting with something like echo or var_dump().

Additionally, if you understand this you understand why using the modulus operator on such large numbers might be giving you very strange results as well. As such it is recommended that you use GMP numbers instead as PHP 5.5 and newer will automatically handle the operator overloading as you attempt to do arithmetic on such large values that need not be represented as signed integers.

If you come from C, you'll notice that in PHP even though a string is technically a char underneath the hood, we still need to tell PHP to obtain the ordinal number of the character in order to be able to do bitwise arithmetic on it.

Checking the hash is just the inverse process. We ask ord($bloomfilter[25]) & (2 ** 3), using a bitwise AND instead of a bitwise OR on the byte.


You can see a full working example of the code explained above in this gist to get a better idea of how you'd go about implementing something like this.

                        
<?php
/**
 * This class relies on the GMP extension to be loaded in PHP
 *
 */
class Bloomfilter
{
    protected $maxElements, $probability, $space, $hashes, $filter;
    public function __construct($numElements, $probability = 0.01)
    {
        $this->maxElements = $numElements;
        $this->probability = $probability;
    }

    protected function calculateSpace($numElements, $probability)
    {
        return (int) ceil(($numElements * (log($probability)) / (log(2) ** 2)) * -1);
    }

    protected function calculateHashFunctions($numElements, $space)
    {
        return (int) ceil($this->space / $this->maxElements * log(2));
    }

    protected function numHashFunctionsAvailable()
    {
        $num = 0;
        foreach(hash_algos() as $algo) {
            $num += count(unpack('J*', hash($algo, 'bloom', true)));
        }
        return $num;
    }

    protected function hash($element)
    {
        $hashes = [];
        foreach(hash_algos() as $algo) {
            foreach(unpack('P*', hash($algo, $element, true)) as $hash) {
                $hash = gmp_init(sprintf("%u", $hash));
                $hashes[] = ($hash % $this->space);
                if (count($hashes) >= $this->hashes) {
                    break 2;
                }
            }
        }
        return $hashes;
    }

    protected function isInit()
    {
        if (!strlen($this->filter)) {
            throw new Exception("Filter has not been initialized!");
        }
    }

    public function initialize()
    {
        $this->space  = $this->calculateSpace($this->maxElements, $this->probability);
        $this->hashes = $this->calculateHashFunctions($this->maxElements, $this->space);
        if ($this->hashes > $this->numHashFunctionsAvailable()) {
            throw new Exception("Can't initialize filter with available hash functions");
        }
        $this->filter = str_repeat("\0", ceil($this->space / 8));
    }

    public function set($element)
    {
        $this->isInit();
        if (!is_scalar($element)) {
            $element = serialize($element);
        }
        $hashes = $this->hash($element);
        foreach($hashes as $hash) {
            $offset = (int) floor($hash / 8);
            $bit    = (int) ($hash % 8);
            $this->filter[$offset] = chr(ord($this->filter[$offset]) | (2 ** $bit));
        }
    }

    public function check($element)
    {
        $this->isInit();
        if (!is_scalar($element)) {
            $element = serialize($element);
        }
        $hashes = $this->hash($element);
        foreach($hashes as $hash) {
            $offset = (int) floor($hash / 8);
            $bit    = (int) ($hash % 8);
            if (!(ord($this->filter[$offset]) & (2 ** $bit))) {
                return false;
            }
        }
        return true;
    }

}
                        
                    

BOOM! goes the filter

So now that we know how to produce the bloom filter, set values, and check values in it; we can evaluate the correctness of our solution for the use case stated at the beginning of this article. Given that a user has 100 friends on our music app, and assuming each friend has listened to roughly 100 songs, we'd need a total of 100 bloom filters with the above parameters (sized to 959 bits with a probability of 0.01). This would consume a total of ~11.72KB in memory to load all 100 bloom filters at once. The time complexity of our query would be O(k * n), which means it's still 7 times slower than using an array where the keys represent the songs we're looking for in each array, since that's O(1), or constant amortized time.

However, the savings in space here is huge! When we consider that 100 arrays with 100 elements each, would consume roughly 1.37MB in memory, that's several orders of magnitude more than our bloom filters. This means you could easily store these tiny bloom filters in memcached or even APC user cache (for blazing fast memory swaps) and not even have to worry as they grow. Since even an order of magnitude increase in both factors would pale by comparison. For example, storing 10,000 bloom filters with 1,000 elements each and even at a 0.1% probability of false positives would only require a total of ~17.15MB of space, and come at a time cost of O(n * 10), vs. using arrays at a whopping ~1.34GB!

Considering that the average bloom filter with roughly 1% false positive rate will consume about 10 bits per element, vs. the 144 bytes per element in a PHP array, we can actually fit a lot more elements into memory at once per request and still have a reasonably reliable query. Sure, we sacrifice a little bit of accuracy and a little bit of speed to do so, but the upside is a big win at scale if the use case makes sense.

In our case we're not too concerned if we a get a false positive every now and then, because the intention is to be able to tell this friend definitely did NOT listen to this song, rather than, they definitely did listen to the song. All we care to know is is it worth going out and recommending this song to your friends, which is a lot more expensive to do then to just check for a definite not in set. The worst that could happen here if we're wrong (we get a false positive) is that we don't recommend the song to your friend. But since we can't get a false negative, we can know for sure that making the expensive call is actually worth it when we do get a definitely not in set answer.