How to Find the Longest Streak

The Maximum Subarray Problem

If you're into sports this one should be pretty fun for you. Say your favorite sports team just won 3 consecutive games. They're on a winning streak! Given an array of their wins and losses for each game they've played in this seasons, how would you find out if this is their longest streak?

Wait a minute...
What am I doing? I don't like sports. Fuck that shit! Let's get back to some geeky shit us programmers actually care about.

So you're on Github and you look at your stats... It looks like you're on a roll! You just made your one-thousandths commit. You notice Github is keeping track of your longest streak and current streak. Have you ever wondered how you'd calculate the longest streak for similar data sets in your own program? What's the most efficient way of doing so? What would the implementation look like?

It turns out the longest streak is a pretty common problem and not always one implemented efficiently, especially on the web where data is known to grow rampantly. If left unchecked, you could let this problem eat away viciously at your server resources. I've seen a lot of people implement this poorly, taking up to O(n2), where it could actually be done linearly. In this article, I'll attempt to go over an efficient implementation and walk you through the process of calculating the longest streak.

Getting the Deltas

Let's dive right into an example where we have an array, $commits, representing two weeks of commit history. Each element in the array is the total number of commits for the day. We know that the array is sorted by date, in ascending order. So the first element represents Sunday, the second represents Monday, the third represents Tuesday, and so on...

The question is how do we figure out where our longest streak is in this array? Meaing, on which consecutive days during this two week period, did we make the largest number of commits? To answer this question we first have to calculate the deltas — differences — between each successive pair of days across this two week span.

$commits = [4, 3, 5, 7, 12, 15, 9, 11, 8, 2, 0, 0, 1, 0];

function getPairs(Array $array) {
    reset($array);
    do {
        $a1  = current($array);
        $key = key($array);
        $a2  = next($array);
        yield [$a1, $a2];
    } while($key = key($array) !== null);
}

foreach(getPairs($commits) as list($a, $b)) {
    $diff     = $b - $a;
    $deltas[] = $diff;
}

Here we just use a generator as a quick and dirty spliterator way of iterating over successive pairs of elements in the array. Such that we get [ [4, 3], [3, 5], [5, 7], ... ] and so on. Now, calculating the delta of each set of pairs becomes trivial and we end up with an array of 13 deltas as a result. These deltas basically tell us whether we have a positive gain or negative loss for that day based on the previous day. It doesn't account for the first day since we have no data from its prior day. So we'll just ignore it, or assume it's 0.

Next, we need to take these deltas and figure out where our largest positive sum of deltas lay, as one contiguous block, within this array. That is to say, we want the maximum subarray of this array and its indices.

The Maximum Subarray Problem

This brings us to the maximum subarray problem. This is basically where you try to find the largest positive contiguous sum of numbers in an array.

In computer science, the maximum subarray problem is the task of finding the contiguous subarray within a one-dimensional array of numbers (containing at least one positive number) which has the largest sum. For example, for the sequence of values [−2, 1, −3, 4, −1, 2, 1, −5, 4] the contiguous subarray with the largest sum is [4, −1, 2, 1], with sum 6.

But how do we find the maximum subarray efficiently? There is actually a linear time implementation for this known as Kadane's algorithm (found by Jay Kadane of Carneigie-Mellon University). The implementation is pretty simple. You just keep track of two scalar values as you iterate over the array. One, is the current_maximum representing the current maximum sum up to the current iteration. The other, is the maximum representing the maximum sum reached across the set. You take the maximum of 0 or the current maximum plus the given number in the iteration and keep it as your current maximum. Then you take the maximum of the cumulative max or current max and keep that as your max throughout the loop. By the end you should have the largest sum in your maximum.

function maxSubarray(Array $array) {
    $c = $m = 0;

    foreach($array as $int) {
        $c = max(0, $c + $int);
        $m = max($m, $c);
    }

    return $m;
}

As you can see the implementation is quite straight-forward and only requires O(n). Now all we have to do is figure out the elements that added up to this largest sum in order to know on which days we encountered our longest streak.

function maxSubarray(Array $array) {
    $current = $max = $end = 0;
    $start = [0];
    foreach($array as $index => $value) {
        $current = max(0, $value + $current);
        $max = max($current, $max);
        if ($current && $current == $max) {
            $end = $index;
        } elseif (!$current) {
            $start[] = min($index + 1, count($array));
        }
    }
    do {
        $s = array_pop($start);
    } while($start && $end < $s);
    $start = $s;
    return [[$start, $end], $max];
}

By modifying our maxSubarray function just slightly, we are able to keep track of the indexes of those elements in the set of deltas, which make up the largest subarray. This tells us on which days we hit our longest streak. Now the function returns a two-dimensional array, where the first element is an array the first and last offset of elements that represent the start and end of our subarray. The second is the sum of those elements.

So if you do the math here it looks like our longest streak of commits was hitting 46 commits between Sunday and Friday, during our first week. Great, now we have an efficient way of doing this!

Bringing it all Together

Now let's do a little clean up, bring all of our code together, and ruffle the feathers one last time, shall we? We're going to rename our maxSubarray function longestStreak and our getPairs function to getDeltas, since this are just more appropriate names for what they're doing here. Then we're going to call getDeltas from within our longestStreak function in order to only loop over the array once. This requires modifying the getPairs function just slightly by having it return the actual deltas of each pair, rather than the pairs themselves. This also means we want to modify our original maxSubarray function implementation to adapt it to return the correct sum and indices of actual $commits array, rather then the deltas that were calculated from that array. Remember, we're now calculating the deltas as we calculate the maximum subarray for better efficiency.

What we end up with is something that looks like this...

function getDeltas(Array $array) {
    reset($array);
    $i = 0;
    do {
        $a1  = current($array);
        $key = key($array);
        $a2  = next($array);
        yield $i => $a2 - $a1;
        $i++;
    } while($key = key($array) !== null);
}

function longestStreak(Array $array) {
    $current = $max = $end = 0;
    $start = [0];
    foreach(getDeltas($array) as $index => $value) {
        $current = max(0, $value + $current);
        $max = max($current, $max);
        if ($current && $current == $max) {
            $end = $index;
        } elseif (!$current) {
            $start[] = min($index + 1, count($array));
        }
    }
    do {
        $s = array_pop($start);
    } while($start && $end < $s);
    $start = max(0, $s - 1);
    $end   = min(count($array) - 1, $end + 1);
    return [[$start, $end], array_sum(array_slice($array, $start, $end - $start + 1))];
}

$dates = [
    '6/28/2015',
    '6/29/2015',
    '6/30/2015',
    '7/1/2015',
    '7/2/2015',
    '7/3/2015',
    '7/4/2015',
    '7/5/2015',
    '7/6/2015',
    '7/7/2015',
    '7/8/2015',
    '7/9/2015',
    '7/10/2015',
    '7/11/2015',
];

$commits = [4, 3, 5, 7, 12, 15, 9, 11, 8, 2, 0, 0, 1, 0];

$streak = longestStreak($commits);
$start  = $dates[$streak[0][0]];
$end    = $dates[$streak[0][1]];

echo "Longest Streak: {$streak[1]} ($start to $end)";

The output from this code should read...

Longest Streak: 46 (6/28/2015 to 7/3/2015)

Notice, we've also added an array of dates to give a little more context to the example.

What About the Current Streak?

Now you might be wondering, how do I get the current streak?

Given what you've learned about finding the longest streak, I'll leave it as an exercise for the reader to figure out how to arrive at the current streak. Here's your hint: it still requires building an array of deltas from the daily commit totals, but you may want to consider building it backwards.