Going Old-School: Designing Algorithms for Fast Weighted Sampling in Production

Shaked Zychlinski

Shaked Zychlinski

Shaked is an Algorithm Engineer at Taboola, working on Machine Learning applications for Recommendation Systems. He specializes in bringing cookies to coffee breaks.

Shaked Zychlinski | 06 Jun 2019 | Big Data

Tags: algorithms, performance, production, real-time, sampling, uncertainty

If you happen to write code for a living, there’s a pretty good chance you’ve found yourself explaining another interviewer again how to reverse a linked list or how to tell if a string contains only digits. Usually, the necessity of this B.Sc. material ends once a contract is signed, as most of these low-level questions are dealt with for us under-the-hood of modern coding languages and external libraries.

Still, not long ago we found ourselves facing one such question in real-life: find an efficient algorithm for real-time weighted sampling. As naive as it might seem at first sight, we’d like to show you why it’s actually not – and then walk you through how we solved it, just in case you’ll run into something similar. So buckle up, we’ve got some statistics and integrals coming up next!

Why We Need Weighted Sampling in Production?

At Taboola, our core business is to personalize the online advertising experience of millions of users worldwide. Thousands of websites across the globe trust us to display each visiting user the ads that he or she will most likely relate to, and most likely to click and engage with. We do that by training several deep-learning-based models which predict the CTR (click-through rate) of each ad for each user. The process of predicting CTR and displaying the highest rated items is known as Exploitation, as we exploit the model’s predictions. But exploitation is not sufficient for a longterm successful model – we need to allow it to do some Exploration of new possibilities too, in order to find better ads.

One of our ideas for such exploration was as following: ask the model to predict the CTR of a list of ads we would like to display, and then instead of displaying the highest rated items, randomly sample items for that list using weighted sampling. This is sometimes known as Soft-Exploration: the highest rated items are still the most probable ones, but every item has some non-zero probability of being shown.

So, we need to do weighted sampling. The most naive approach to do so will be something like this:

results = empty list
max_r = 1
for i from 0 to amount of given weights:
..r = Random number in the range (0,max_r]
..s = 0
..for each weight in the list of weights:
....s += weight
....if s => r:
......append weight to results
......remove weight from weights
......max_r = max_r - weight
return results

This naive algorithm has a complexity of O(n^2). Considering the fact that the lists are long and all this is happening in real-time, this algorithm is a no-go. But there has to be a better way to do this, right? Well, yes, but we had to design it ourselves.

The Theoretical Solution

Looking hard enough for an algorithm yielded a paper named Weighted Random Sampling by Efraimidis & Spirakis. A single line in this paper gave a simple algorithm to what we should do (page 2, A-Res algorithm, line 2):

1) map each number in the list: w \rightarrow r^{1/w}
..(r is a random number, chosen uniformly and independently for each number)
2) reorder the numbers according to the mapped values
..(first number will be the one which has the largest mapped-value, and so on)

This algorithm involves mapping and sorting, making it  O(n \log n) , way better than O(n^2), but there’s still one issue – the authors never proved it. And since we had no proof this is actually working, we had to prove it ourselves. Brace yourselves, integrals are coming.

Before we begin, as the following lines can be easily confusing, let me briefly describe what I’m about to do:

  1. I will first describe how a weighted-sampling probability-distribution should behave. As this is what we’re eventually looking for, formalizing it mathematically is probably a good idea.
  2. Once we formalized the distribution we want, we will find a specific distribution we can use for weighted sampling.
  3. Lastly, after finding a specific distribution, I’ll link it to the Uniform Distribution, (just like the algorithm above). We’ll be amazed by the fact that the suggested mapping w \rightarrow r^{1/w} really does imitate random sampling.

Ready? Let’s go.

Part I: Formalizing

How does weighted sampling behave? Let’s say we have two numbers, w_1=1 and w_2=2, which we perform weighted sampling over. We’d expect to get the sequence (2,1) two-thirds of the time, and the sequence (1,2) a third of the time. So we expect w_2 to be the first number 66.6% of the times and the second 33.3% of the times. Another way to look at this, is that since we’re sorting the numbers in a list, we’d expect the priority (how close a number is to the head of the list) of w_2 to be the highest two-thirds of the times, and the lowest one-third of the times. You can easily see that priority, which we’ll denote as m, behaves in a way like an inverse-index, meaning the highest m is the first one on the list. We’ll prefer it over the index for two reasons: first, the priority increases as w increases, and it’s more intuitive than the index, which decreases as w increases. Second, the absolute values of the priorities are not relevant; it doesn’t matter if (m_2,m_1) equal to (4.5, 3) or (-1, -5) or (1024, 5). All that matters is the order between them – the highest will be first, then the second-highest and so on. These two characteristics will allow us to generalize better later on. So to wrap this example up, in the case of  w_1=1 and w_2=2, we would like to find a probability distribution which will yield m_1,m_2 which obey:

 m_1, m_2 : \begin{cases} m_2 > m_1,& \text{with probability } p = \frac{2}{3}\\ m_2 < m_1, & \text{with probability } p = \frac{1}{3} \end{cases}

Let’s generalize this and formalize it mathematically: for every two numbers w_1, w_2, we would like to have two random variables M_1, M_2 which originate from a probability distribution D_w (meaning: M_1, M_2$ \sim D_w), where D_w is a probability distribution defined by all w values provided (in this simple example there are only two, w_1 and w_2, but generally there could be more). We expect M_1 < M_2 with probability p = w_2/(w_1 + w_2).

Part II: Finding a Specific Distribution

I claim that the probability distribution defined by the Cumulative Distribution Function (CDF)  F_w(x) = x^w  obeys the requirement above – and I’ll prove it. For this, remember that the Probability Density Function (PDF)  f_w(m)  obeys  f_w(m) = \frac{d}{dm} F_w(m)  , and therefore in our case:  f_w(m) = wm^{w-1} . I’ll also denote the Indicator Function as I (which means I_(m_1 < m_2) is 1 when m_1 < m_2 and 0 otherwise). Finally, we’ll work only on the range [0,1]:

 \begin{align*} p(M_1 < M_2) & = \int_{m_2=0}^1 \int_{m_1=0}^1 \mathbb{I}_{m_1 < m_2} f_{w_1}(m_1) f_{w_2}(m_1) dm_1 dm_2 \\ & = \int_{m_2=0}^1 \int_{m_1=0}^{m_2} f_{w_2}(m_2) f_{w_1}(m_1) dm_1 dm_2 \\ & = \int_{m_2=0}^1 f_{w_2}(m_2) \biggr[ m_1^{w_1} \biggr]_{m_1=0}^{m_2} dm_2 \\ & = \int_{m_2=0}^1 w_2 m_2^{w_2-1} \cdot m_2^{w_1} dm_2 \\ & = \int_{m_2=0}^1 w_2m_2^{w_1 + w_2 - 1} dm_2 \\ & = \frac{w_2 \cdot m_2^{w_1+w_2}}{w_1+w_2} \biggr|_{m_2=0}^1 \\ & = \frac{w_2}{w_1 + w_2} \end{align*}

So we’ve proved that the distribution with CDF  F_w(m) = m^w   indeed imitates weighted sampling.

Part III: Linking to Uniform Distribution

As programmers, the Uniform Distribution is usually the most accessible one we have, regardless of language or libraries. It will only make sense to link the custom-made distribution we just found to the Uniform Distribution, which will then allow us to use the latter for weighted sampling.

Say some X is yielded from D_w (that is, X \sim D_w), what is the probability X is smaller than some number \alpha? This is given by the CDF:

 p(X < \alpha) = F_w(\alpha) = \alpha^w

Let’s examine another variable, Y, which we’ll define as Y = \sqrt[w]{R}, when R originates from the Uniform Distribution r \sim U(0,1). What is the probability that Y is smaller than \alpha? Let’s calculate, remembering that the CDF of U(0,1) for any 0 \leq u \leq 1 is F_U(u) = u:

 p(Y < \alpha) = p(\sqrt[w]{R} < \alpha) = p(R < \alpha^w) = F_U(\alpha^w) = \alpha^w

This is the same result we got for X which was sampled from D_w, and this means we can sample a number from U(0,1), take its wth root, and it would be just as if we used D_w all along. This means that the priority m of a number w is given by m=r^(1/w). Neat.

Really Making It Work

There’s a saying I like which states that the difference between theory and practice is that theory only works in theory. So we found a fast-enough algorithm, proved it mathematically, and of course it doesn’t work. Why? Because computers. Let’s take a look at our m values again: m=r^(1/w). As mentioned before, we use our models to predict CTR, and so w = CTR, which is always a number in the range of [0,1], and usually very small. As r is also sampled from the same range, r^{1/w} becomes very small, as r<1 and 1/w>1. It actually becomes so small and so often, that the computer doesn’t handle the precision very well, and we get zeros for all values. Let’s see an example using Python:

>>> from random import random
>>> m = lambda w: random() ** (1.0/w)
>>> m(0.002), m(0.001)
(0.0, 0.0)

Not good.

So what can we do? Yet again, math to the rescue! As we’ve previously said, we don’t really care about the absolute values of m, but only about the order between them. This means we can apply any monotonic function on the mapped values and control how fast the numbers decrease. For us, a simple logarithm did the trick: \log(r^{1/w}) = \frac{\log(r)}{w}. Let’s see this in action:

>>> from random import random
>>> from math import log
>>> m = lambda w: log(random())/w
>>> m(0.002), m(0.001)
(-402.5850708710584, -758.9101125141252)

Much better. Still, this doesn’t come without a price tag – the logarithm we apply decreases the accuracy of the algorithm. This means that in our example of w_1=1 and w_2=2, we won’t get   m_1 < m_2  with probability 2/3, but something close. For us though, this deviation is something we’re fine with.

So, to wrap this up, our random-weighted sampling algorithm for our real-time production services is:

1) map each number in the list: w \rightarrow \log(r)/w
..(r is a random number, chosen uniformly and independently for each number)
2) reorder the numbers according to the mapped values
..(first number will be the one which has the largest mapped-value, and so on)


Final Words

Summing this process up, we’ve started with a naive algorithm which wasn’t efficient enough, moved on to the exact opposite – an efficient algorithm which doesn’t work, and then modified it to an almost-exact version which works great and is also efficient.

If I need to conclude, I can only say this – there’s something super exciting about stepping down from our daily routine of developing state-of-the-art AI models and return to our roots as algorithm developers; going back to the basics, develop mathematical proofs, sleeping by the river under starry skies and cooking dinner by the fire – we don’t get to this every day, and I think we’re all glad we did it this time. So wherever you may surf online, know that we just made your experience a little better using plain ol’ math.