How to Build a URL Shortner like bit.ly or goo.gl

Bijection functions and real time analytics

URL shortners like bit.ly and goo.gl have grown quite popular over the years for several reasons. They make it easier to share lengthy URLs over mediums where character count poses a problem. Like SMS text messages in mobile devices and tweets where there are 140 characters, per message, constraints. They also serve an analytical purpose for tracking click-throughs, and other visitor data like client browser software, geo-location information, desktop vs. mobile device, etc… In fact, many of these services are used specifically for the purposes of marketing. Businesses will use the shortened URL to track campaign engagements through email, other websites, or from QR codes passed out through offline media. There's also the spam proponent where spamers and some malicious users will use the shortened URL to hide the true destination in emails and websites, with the intent of distributing malware, creating phishing scams, or conducting other malicious activities.

Many people have approached me in the past asking How do I build one of these URL shortening services for myself?, because it turns out these services tend to introduce a lot of non-trivial problems once they get up and running. The idea of taking an arbitrary length input and providing a fixed-length output in return is not a complicated one in computer science. We call that a hash function. You take a an arbitrary length URL like www.phpden.info and turn it into a fixed length URL like phpden.info/mO192S and what you have is basically a hash of that URL.

Hash functions have been around for a long time and there's certainly no shortage of hashing algorithms available. So the issue isn't necessarily how do I make the URL shorter, but rather what comes next. However, just to be thorough let me explain exaclty how you would apply the concept of a hash function to a URL since I see lot of people implementing this poorly.

If you use a URL shortner like goo.gl, for example, as a non-registered user, your destination URL shares the same short URL as any non-registered user that supplies the same destination URL. This means if two people both submit the URL www.phpden.info the should both end up with the same 13 character short URL. So this is very similar to a hash in that the same input will always result in identical outputs, but no two different inputs should result in the same output.

The Birthday Problem

So now we get to the Birthday Problem or hash collision problem that we typically see in any hashing function. Because there are an infinite number of possible inputs, but a very finite number of possible outputs, we're bound to run into cases where there are collisions. Two different inputs resulting in the same exact output. This is only exasperated by the fact that the short URL consists of only a 6 character hash, which decreases our entropy space significantly. If use all upper-case and lower-case letters in the English alphabet a-Z and all digit characters 0-9 we get a total of 62 possible characters, which gives us 626 or 56,800,235,584 possible combinations. Now, in all likelihood none of us will ever build our own personalized URL shortning service that will get more than 56 billion unique URLs. Heck, even bit.ly (as of the time of this writing) has less than 28 billion URLs! However, for the sake of argument, let's say we get to or even exceed the scale of bit.ly or goo.gl to try and figure out how we might deal with this problem.

One thing to consider is that the likelihood of hitting your ceiling here is quite low when the destination URLs are unique, because the same destination URL cannot generate more than one short URL. Which means we're merely mapping a one-to-one relationship between destination and short URLs. The other thing is that these URLs can expire if they aren't used for a certain length of time. You could also place constraints on how many of them can be generated at a time or on a per-domain or even per-host basis. But an even more desirable outcome is that you could use UTF-8 characters to significantly increase your entropy space without necessarily increasing the character count.

Let's pretend we didn't care about uniqueness of destination URLs though. We could simply use the same technique we use for password hashing and salt the hash such that two inputs of the same value result in two different hashes. This same salting technique could also be used to mitigate hash collisions where two different inputs end up resulting in the same hash, because it randomizes the entropy even more. So we get to sort of solve two problems with one function here.

What this doesn't do though is increase your capacity constraint for how many URLs you can store. To increase the capacity we have to increase the length (or fixed size of the hash) or use a larger alphabet.

The Bijection Function

Since it's a one-to-one mapping in the case of non-registered users we call this a bijection which just means that every element from one set maps to an element from another set by way of a function.

So your shortened URL always corresponds to exactly one destination URL. And one destination URL always corresponds to exactly one shortened URL. To get there we want to first hash the destination URL that should give us a potentially unique number. Then we'll take that number and mod it to get a number that fits within our desired fixed length. I'll use SHA512 for this example but feel free to choose whichever hashing algorithm you'd like.

                        
function hashURL($url, $len = 6, $base = 62)
{
    $hash = hash('sha512', $url, true);
    $hash = gmp_import($hash) % ($base ** $len);
    return $hash;
}

echo hashURL('http://www.phpden.info'); // gives us 44329281076
                        
                    

As you can see our hash is guaranteed to fit within the range 0 through 56,800,235,583, inclusive, given our desired 6 character length using a base of 62. This means that if we just translate the number provided by our hashURL function from base 10 to base 62, we we would get the following a result of mO192S.

                        
function biject($hash, $fromBase = 10, $toBase = 62)
{
    $h = $hash instanceof GMP ? $hash : gmp_init($hash, $fromBase);
    return gmp_strval($h, $toBase);
}

echo biject(hashURL('http://www.phpden.info')); // gives us mO192S
                        
                    

Going the other way, we simply just revert the base back from base 62 to 10.

                        
echo biject("mO192S", 62, 10); // gives us 44329281076
                        
                    

Now we have a two way bijection function to go back and forth between base10 and base 62 numbers. What's useful about doing this is that we can simply store this base 10 number in our database as a unique PRIMARY KEY to identify the destination URL entities we store.

If you didn't want to use a hash for the URL and just allow the same destination URLs to have multiple short URLs you could simply just pick a random number between 0 and 626 and run it through the biject function to get a random short URL designation. In theory to guarantee that the short URL generated each time is unique you might use something like a prefix trie and randomly select nodes from level-order then delete the deepest consumed nodes. However, this would be a huge space complexity problem even though it is a very efficient time complexity solution. Now, there are ways to not have to load the entire prefix trie into memory at once, by using level order traversal and loading only nodes from each level at a time using a bit vector. However, it's actually faster to just ask our database if the id is already stored in the table (given that we properly indexed the table) and just try generating a new random number if the previous one already exists. Given that we're using 64 bit integers the computation is actually rather efficient as the DBMS can actually use a radix trie to do the leg work for us. This works well for as long as you have used up a tiny bit of your entropy and don't receive too many frequent requests to generate short URLs. It gets rather daunting once you hit a certain limit, however, and implementing the more complicated, but efficient prefix trie then becomes the more sought out solution. Though I won't bother trying to implement such an efficient solution in this article using PHP since that will probably drive half my readers mad.

But just for the sake knowledge I will explain the concept behind it…

The Prefix Trie Trick

To be able to efficiently pick a random value without ever repeating the same value twice, we could build a prefix trie of every possible value in our alphabet given our desired entropy space. To zoom in on this idea, let's say our alphabet consists only of 0s and 1s. So we have a maximum of 2 choices with a maximum length of 3 characters. This gives us a total of 8 possible combinations.

By starting at the root node and selecting a random child node at each given level of the trie we are guaranteed to get the desired fixed-length every time if we simply follow the leave nodes to the end of the trie.

So let's say we happen to select 010 at random, as outlined by the blue highlighted nodes in the illustration above. Once we get to the last leaf node we can simply delete it, thereby removing it entirely from the pool of possible combinations.

Now if we try to randomly select nodes again starting from the root there is no possible way to arrive at the same value twice. No matter which path you take it is not possible to get to the value 010 ever again!

This is a much more optimal solution for when the number of used combinations starts to exceed the number of remaining unused combinations in our pool of choices. So in the case of using a 62 character alphabet, if we consumed say 50 billion of these possible choices, the probability of select a choice at random that will not have already been inserted into the database is 1 in 10, which is pretty low. We're likely to have to hit our database up to 9 times on average for every newly created short URL. That's probably not so desirable when you reach that scale. The prefix trie trick makes the computation much faster. A radix trie could also be used to make it even more memory efficient. The bit vector could be used to make it even more memory and CPU efficient without sacrificing accuracy. So while you're definitely not going to hit a wall right away with the first simple, but effective, approach of conferring with your database, I wanted to demonstrate that it is possible to still solve this problem efficiently at scale.

The Analytics Problem

Now in truth, you will likely be receiving a lot more incoming requests to your short URLs than you would be generating new short URLs. So the more immediate load on your database is going to be in writing analytics data efficiently, and not necessarily this generating of short URLs. At first glance it would seem like it should be simple. When someone hits your short URL, you look up the id in your database using the biject function we demonstrated earlier to translate the short URL to a base 10 integer. Then you can insert a new row in your database with the visitor information for analytics purposes, and finally redirect the client to the destination URL.

The problem begins once you try computing this analytics data for the client. Let's say we get on average 100,000 requests to a given short URL per day. After about a year, computing something as simple as the number of times the URL has been visited, or even worse, the number of unique visitors to the URL would result in a tremendous amount of memory and CPU consumption because we're computing the aggregate results on read. This means every single time the analytics page is requested, we have to build up the whole world from scratch, examine every row in the database that contains the given short URL id, and calculate the aggregate value. That's just bad at almost any scale.

Now, I know some of you might be thinking, oh I'm smart, I have a clever solution to that. I'll just cache the result the first time and no-worries! Sure, that works just fine for speeding up the subsequent analytics requests, but there are two problems there.

  1. It still takes a long time to serve the request the first time or when the cache misses
  2. The results go stale pretty quickly if the link gets a lot of traffic

The better way to solve this problem is at the write-level not the read-level. This is called aggregate-on-write, which is very different from just caching the results of your database queries. It also makes it possible to start doing real-time analytics since the computation is triggered by the write event and not the read event.

So a while back I wrote a library called webSAT and put it up on Github for just this purpose. The idea is that it's more efficient to aggregate analytical data from your web server at the time that it's being written than to try doing so when it comes time to read that data. With something like webSAT most of your analytics operations would be reduced down to something as simple as incrementing an integer, which boils down to basically just a single CPU instruction. It's also pretty space efficient as you don't need to read in an entire database result set. You can rely directly on memcached for the aggregated de-normalized data and still maintain your database as the source of truth.

Now things like number of times a URL is visited are pretty simple. They boil down to just incrementing an integer based on the source URI. However, something like number of unique visitors isn't so trivial. You first have to figure out how you define a unique visitor. One option could be to use a unique cookie, but clients delete cookies all the time and this could be used to skew our analytics data quite easily. You could use something like the client's remote IP address instead as a more unique global identifier that's harder to skew the data. It doesn't necessarily identify a unique person, but it does at least give us a close heuristic to work with.

So to do this we have to keep track of every single IP address that hits a short URL. In the IPv4 address space alone, there are over 4 billion unique IPs. Multiply that by the 56 billion unique short URLs we can store with a 6 character URL and you're looking at millions of trillions of data points to track. That's way too big to compute efficiently in real time. So instead we could use something like a bloom filter to identify if the IP doesn't already exist in our set before incrementing the aggregate value number of unique visitors. If you've read my previous article on Bloom Filters in PHP you already know this is a much more space efficient data structure then using something like a hash table, for example. So it's possible to still do this kind of analytics work in real time pretty efficiently using a data structure like that where the computation requires knowing the uniqueness of the data point in a given data set.

The Spam Problem

Finally we arrive at the most daunting of all problems you could encounter in building a URL shortener. The spam! The people that want to abuse your system for ill-intent.

Quite honestly, I've dealt with a lot of complicated spam issues throughout my career and none of them stood out to me more than those caused by the abuse of URL shortners. The issue you get here is not only do you have to worry about where the destination link sends us, but also what it sends us. For example, let's say I'm a spamer who wants to distribute malware to unsuspecting Internet goers by using Google's URL shortner to hide my destination URL from the naked eye. If Google catches on they can simply block my URL from going through their URL shortening service again. However, if I was clever I could just change the URL or have a new URL redirect to the same destination. So now we have to follow the entire HTTP redirect chain. OK, but what if I use javascript to do the redirect? Now we have to implement a javascript runtime to execute the destination response and make sure it's not doing anything malicious! OK, but what if I use iframes or generate a pop up window in the browser and put the spam there? Now we're basically just implementing a headless browser.

Of course Google and bit.ly have the resources to solve these problems at scale and rely on crowd-sourcing to weed out spam, but as you can see the problem is not trivial to solve once you try to examine all of its depths and breadths. There are a lot of edge cases and you could end up getting your URL shortner website on some blacklists (since you're effectively colluding with the spamers) if you're not careful.

Do I actually want to explain these depths to you here? Oh fuck that! I could write books about this stuff and still not even scratch the surface!

Conclusion

Don't write your own URL shortener. Go use goo.gl or bit.ly instead.