Stuffing an elephant into a hat…

Let's imagine you have built a music app like Spotify. You've finally grown this thing to sizeable
amount of users and you have a decent number of titles in your content library. Let's also say this
app has social elements to it so your users can connect with their facebook friends or twitter
followers. Now, let's say each time your users play a song in your app you want to ask the question

? The intention
being that you may recommend that song to them if they haven't listened to it.
*Which of this user's friends have NOT listened to this song yet*

The question is not so simple at scale, because it introduces exponential elements of both time and space
complexity, as the number of users and number of titles available, grow. The biggest worry is how
much memory it will take your API to respond to the question when the user hits the play button.
If the user has thousands of friends and they've each listened to thousands of songs that's millions
of permutations were we to just use a needle/haystack search. The space complexity is also huge!
*So how do we fit potentially millions of elements into memory efficiently and still be able to
query this data quickly?*

One solution to this problem is to use a data structure known as a **bloom filter**.
A bloom filter is basically a very space-efficient hash set with probabilistic tendency. If you
aren't familiar with a hash set or sets in general, let's do a quick review of what they mean.

A set is basically just a collection of unique elements. A hash set is a more space efficient
way of storing a set of elements such that we only store the output of a one-way hash
function of this element. This allows us to ask the question does `X`

belong to the set,
very quickly, and efficiently such that the set query has `O(1)`

time complexity and
the space efficiency is fixed by the hash function.

We already know of a data structure with similar properties in PHP today. It's called the `Array`

.
Since PHP arrays are just ordered *hash maps*, the keys in a PHP array are just a hash set that
*map* to values. So if we removed the map part of that we would have what's known as a
hash set. This means that when we do `$arr["foo"]`

in PHP, PHP doesn't have to iterate
the entire set of keys in the array to find `"foo"`

. Instead, the key `"foo"`

is just hashed and the thing we look for is the hash itself and this gives us an `O(1)`

search for the key.

OK, for those of you who are pedantic you know I'm lying here because technically the implementation of this in PHP is a doubly linked list (and in PHP 7 quite possibly a flattened array in most cases), so that's not entirely accurate, but hey I'm trying to make a relate-able point here!

Great, so now that we have an understanding of what sets, hash sets, and hash maps are we can explore why a bloom filter is an even more efficient representation of a set than a hash map or a hash set.

It's important to note that a bloom filter sacrifices accuracy for space efficiency. This is because the premise of the bloom filter is to compress a hash.

Wait a minute...

What?!

Why would you compress a fixed-size output of an arbitrary-length input?

Well, this is why I liken the concept to that of trying to pull an elephant out of a tiny little
hat. Not only do we shrink the arbitrary length input down to a fixed size hash, but we also try
to shrink the hash down to a single bit.

The idea is that if you use multiple hash functions and store only the boolean interpretation
of the hash's presence in the set (*i.e. either the hash exists or it doesn't exist*),
and not the hash value itself, then what we end up with is an efficient probabilistic data
structure that can ascertain, within a given–probability of error, whether or not
the element exists in the set. This has a unique property, however. In that a bloom filter can
provide false positives, but never false negatives.

In other words, a bloom filter can either tell you

or *this element is probably in the set*

.this element isdefinitely not in the set

Another interesting characteristic of the bloom filter is that it typically only provides a way
for you to `set`

and `check`

elements. In other words, you can't actually
`delete`

elements from the set once they're set. However, there are other ways to achieve this
by using a counting filter, for example. But for now we'll just focus on learning about the
traditional bloom filter since our music app use case clearly doesn't require us to delete things from the set.
Because, you can't exactly *un-listen* to a song once you've listened to it, right?

So to really grok the concept behind bloom filters let's start with a very rudimentary example of how this would work. Let's imagine that we have 8 light bulbs that represent a bloom filter. In their initial state, all of the light bulbs are turned off. Once we add an element to the bloom filter we will turn one or more of these light bulbs on according to our hash functions.

Let's say our bloom filter has just 2 hash functions to start. Let's call them `h1()`

and `h2()`

.
So for every element we add to the bloom filter we run our input through each of these hash
functions and what we get back is a number between 1 and 8. We simply find the position of the
light bulb using that number and then turn it on. If it's already on we just leave it on.

So let's say the input `"A"`

to the hash function `h1()`

gives us `1`

.
Then `h2("A")`

gives us `7`

. So let's turn on light bulbs 1 and 7.

Now if we want to check whether or not `"A"`

exists in our bloom filter we simply
run the input back through our hash functions and ask are these light bulbs all turned on

?
If even one of them is not on then the element definitely does not exist in the set.

However, do note that different inputs could hash to the same positions. So just because all of
the light bulbs are turned on doesn't necessarily mean the element is definitely in the set. It
just means it's probably in the set. For example, let's say we run `"B"`

through our
hash functions and arrive at `4`

and `1`

. Light bulb `1`

is
already turned on. So we leave it on and turn on light bulb `4`

.

Now let's suppose we ask the question is `"C"`

in the set? If `"C"`

hashes
to `4`

and `7`

the answer will be *yes*, the element probably exists.
However, that's only a result of both `"A"`

and `"B"`

turning on light bulbs
`4`

and `7`

even though we never added `"C"`

to the set.

This is what makes a bloom filter probabilistic. Because we can actually calculate to a certain
probability how likely it is for us to get a false positive. Given that we know the size of the
bloom filter and the number of hash functions used, we can calculate the probability that
a light bulb is not turned on by any of the hash functions (*and assuming all the hash functions
have even distribution over the light bulbs*) as `(1 - 1/n)`

where
^{k}`n`

is the number of light bulbs in the bloom filter and `k`

is the number of
hash functions used. To get the probability that a membership query will return a false positive
in the set we need to also consider the number of elements already inserted into the set along
with the bloom filter's size and number of hash functions used. We can calculate this as
`(1 - (1 - 1/n)`

where ^{(k * j)})^{k}`j`

is the number
of elements already inserted into the bloom filter.

OK, because I know the mathematicians will hound me over this one, the above formula is only an approximation of the probability of false positives. It's not entirely accurate because it assumes that disjointed probability. Since independent events don't necessarily conclude us reaching a collision in this scenario (as demonstrated above by our earlier example). However, it's closeenoughfor our purposes.

Meaning that in our example here, there is a ~17% chance that we will have a false positive after
two elements are inserted into our bloom filter or a ~30% chance of a false positive after three
elements are inserted into our bloom filter.
Which is obviously pretty high, but you could lower that probability by increasing the size of
the bloom filter (`n`

) and the number of hash functions (`k`

) used.

To see how this applies in practice, let's turn our light bulbs into bits. If we consider that
a bit (*which represents either a 1 or 0, i.e. either an on or an off state*) tells us
that a hash is either set or not set in the bloom filter, we can get a binary representation of the hash.
In PHP this can be achieved easily by using a string. Since strings in PHP are just 256 bit ordered
bytes (*i.e. 8 bits per character*). This gives a relatively easy way to initialize a
bloom filter of any size (*padded to 8 bits*) by generating a string of `ceil($n / 8)`

bytes, where `$n`

represents the number of bits in the bloom filter.

To calculate how many bits we will need we need to know in advance how many elements we plan on storing in the bloom filter. Since the bloom filter is already pretty cheap in terms of memory we can easily over-estimate this number without much penalty. But how do we know how many bits to use for a given number of elements?

To do this accurately we first need to consider the desired probability of false positives in our
bloom filter. For example, let's say we want to stick with a `1%`

probability of false positives
(*represented by 0.01 as a decimal or $percent / 100*).
We can use the formula

`$m = (($n * log($p)) / (log(2) ** 2)) * -1`

where `$n`

is the number of elements we want to be able to store in the bloom filter and `$p`

is
the probability of false positives in the bloom filter. We use the natural logarithm function as
`log(2, M_E)`

where `M_E`

is the irrational constant `2.718281828`

used as the natural base of the logarithm.
So a bloom filter that will store say 100 elements with a 1% probability of false positives will
require `958.50583773674`

bits, or rounded up to a whole number as `959`

bits.
To represent this as a string in PHP padded to the nearest octet we say `str_repeat("\0", ceil($m / 8))`

such that what we end up with is a string of 120 null bytes (*all the bits are off*).
Enough space to fit our 100 elements in the bloom filter with only a 1% probability of false
positives occurring throughout.

Now the question is how do we figure out how many hash functions we will need for this optimal
1% probability of false positives occurring. To do this we take the size of the bloom filter in bits
and the number of elements and plug them into the formula `$k = $m / $n * log(2)`

where
`$m`

is the size of the bloom filter in bits (*which we just came up with above*).
Now `$k`

represents our optimal number of hash functions needed to maintain a 1% probability
of false positives. So this means we will need `6.6438561897747`

hash functions, or
`7`

hash functions rounded up to the nearest whole number.

Finally, we take our null-initialized string and simply flip the respective bits, indicated by the hashes produced from our hash functions, and we have a bloom filter! Easy as pie, right?

Well, not quite....

Now comes the really fun part where we get to do some neat arithmetic!

First, you need to figure out a way to size the output from the hash functions to fit our
bloom filter. Because the intention is to always get a number from the hash function between
0 and 958 (*as indicated by the bloom filter's number of available bits*), we use
clock-arithmetic to do this.

Take, for example, the output from `md5("Hello PHP", true)`

as an arbitrarily sized binary
number. If we simply unpack this number into an integer space and use a modulus of `$m`

what we end up with is a number between 0 and 958, which can represent the offset of the bit in
our bloom filter that needs to be flipped on, or checked. So `unpack("J", md5("Hello PHP", true))`

would give us the number `14213587061305870973`

represented as an unsigned long long (64 bit
big endian first) integer. If we do `14213587061305870973 % 959`

we should get `203`

.
So we simply have to find the bit at offset `203`

in our bloom filter and turn it on.
We can get the byte offset if we divide `203 / 8`

, which gives us `$bloomfilter[25]`

,
or the byte at offset 25 of our bloom filter string. Then if we take the remainder of that division
`203 % 8`

we know to flip the 3rd bit at that offset. So `ord($bloomfilter[25]) | (2 ** 3)`

.

Of course, the wise folk of PHP know that you can't get an unsigned integer representation in a native PHP integer data type. As such my example above of unpacking the

`md5`

hash into an integer space, when observed though a PHP integer type, provides a negative value. You can see that it's still correct when represented as an unsigned value, however, with something like`sprintf("%u", ...unpack("J", md5("Hello PHP", true)))`

which would give you the number above instead of the`-4233157012403680643`

you might be getting with something like`echo`

or`var_dump()`

.Additionally, if you understand this you understand why using the modulus operator on such large numbers might be giving you very strange results as well. As such it is recommended that you use

`GMP`

numbers instead as PHP 5.5 and newer will automatically handle the operator overloading as you attempt to do arithmetic on such large values that need not be represented as signed integers.

If you come from C, you'll notice that in PHP even though a string is technically a `char`

underneath the hood, we still need to tell PHP to obtain the ordinal number of the character
in order to be able to do bitwise arithmetic on it.

Checking the hash is just the inverse process. We ask `ord($bloomfilter[25]) & (2 ** 3)`

,
using a bitwise `AND`

instead of a bitwise `OR`

on the byte.

You can see a full working example of the code explained above in this gist to get a better idea of how you'd go about implementing something like this.

` ````
<?php
/**
* This class relies on the GMP extension to be loaded in PHP
*
*/
class Bloomfilter
{
protected $maxElements, $probability, $space, $hashes, $filter;
public function __construct($numElements, $probability = 0.01)
{
$this->maxElements = $numElements;
$this->probability = $probability;
}
protected function calculateSpace($numElements, $probability)
{
return (int) ceil(($numElements * (log($probability)) / (log(2) ** 2)) * -1);
}
protected function calculateHashFunctions($numElements, $space)
{
return (int) ceil($this->space / $this->maxElements * log(2));
}
protected function numHashFunctionsAvailable()
{
$num = 0;
foreach(hash_algos() as $algo) {
$num += count(unpack('J*', hash($algo, 'bloom', true)));
}
return $num;
}
protected function hash($element)
{
$hashes = [];
foreach(hash_algos() as $algo) {
foreach(unpack('P*', hash($algo, $element, true)) as $hash) {
$hash = gmp_init(sprintf("%u", $hash));
$hashes[] = ($hash % $this->space);
if (count($hashes) >= $this->hashes) {
break 2;
}
}
}
return $hashes;
}
protected function isInit()
{
if (!strlen($this->filter)) {
throw new Exception("Filter has not been initialized!");
}
}
public function initialize()
{
$this->space = $this->calculateSpace($this->maxElements, $this->probability);
$this->hashes = $this->calculateHashFunctions($this->maxElements, $this->space);
if ($this->hashes > $this->numHashFunctionsAvailable()) {
throw new Exception("Can't initialize filter with available hash functions");
}
$this->filter = str_repeat("\0", ceil($this->space / 8));
}
public function set($element)
{
$this->isInit();
if (!is_scalar($element)) {
$element = serialize($element);
}
$hashes = $this->hash($element);
foreach($hashes as $hash) {
$offset = (int) floor($hash / 8);
$bit = (int) ($hash % 8);
$this->filter[$offset] = chr(ord($this->filter[$offset]) | (2 ** $bit));
}
}
public function check($element)
{
$this->isInit();
if (!is_scalar($element)) {
$element = serialize($element);
}
$hashes = $this->hash($element);
foreach($hashes as $hash) {
$offset = (int) floor($hash / 8);
$bit = (int) ($hash % 8);
if (!(ord($this->filter[$offset]) & (2 ** $bit))) {
return false;
}
}
return true;
}
}
```

So now that we know how to produce the bloom filter, set values, and check values in it; we can
evaluate the correctness of our solution for the use case stated at the beginning of this article.
Given that a user has 100 friends on our music app, and assuming each friend has listened to roughly
100 songs, we'd need a total of 100 bloom filters with the above parameters (*sized to 959 bits
with a probability of 0.01*). This would consume a total of ~11.72KB in memory to load all
100 bloom filters at once. The time complexity of our query would be `O(k * n)`

, which
means it's still 7 times slower than using an array where the keys represent the songs we're looking
for in each array, since that's `O(1)`

, or constant amortized time.

However, the savings in space here is **huge**! When we consider that 100 arrays
with 100 elements each, would consume roughly 1.37MB in memory, that's several orders of magnitude
more than our bloom filters. This means you could easily store these tiny bloom filters in
memcached or even APC user cache (*for blazing fast memory swaps*) and not even have
to worry as they grow. Since even an order of magnitude increase in both factors would pale by
comparison. For example, storing 10,000 bloom filters with 1,000 elements each and even at a
0.1% probability of false positives would only require a total of ~17.15MB of space, and come at
a time cost of `O(n * 10)`

, vs. using arrays at a whopping ~1.34GB!

Considering that the average bloom filter with roughly 1% false positive rate will consume about
10 **bits** per element, vs. the 144 **bytes** per element in a PHP
array, we can actually fit a lot more elements into memory at once per request and still have a
reasonably reliable query. Sure, we sacrifice a little bit of accuracy and a little bit of speed
to do so, but the upside is a big win at scale if the use case makes sense.

In our case we're not too concerned if we a get a false positive every now and then, because
the intention is to be able to tell

,
rather than, they *this friend definitely did NOT listen to this song*

, which is a lot more expensive to do then to just check for a definiteis it worth going out and recommending this song to your friends