## Implementing inverse transform sampling, rejection sampling and importance sampling in Python

In this post we’ll discuss how to sample from a probability distribution. This is a common need, in an ML setting this is most frequently used to run inference for probabilistic models. However, due to the distributions being very complex, this is often intractable. Thus, main focus of this post is introducing approximate methods for this task, in particular using numerical sampling, known as Monte Carlo methods.

Still, for introductory purposes we’ll introduce inverse transform sampling first, which allows exact inference for arbitrary, tractable distributions. Then, we’ll shift our focus to approximate methods, allowing sampling or moment estimation for (near) arbitrary distributions: we start with rejection sampling and then move to importance sampling.

Note this post is the first in a series familiarising readers with [1], and this post in particular covers parts of Chapter 11: Sampling Methods.

The inverse transform sampling method allows sampling from any distribution for which we know how to calculate the inverse of the cumulative distribution function (CDF). It entails sampling `y`

from `U[0, 1]`

(the uniform distribution), and then calculating the desired `x`

as:

I.e., we are generating a random variable

and then claim that

— that the inverse of the cumulative distribution (denoted by `F`

) follows our desired target distribution (more specifically, its probability density function, denoted by `f`

).

This is the case if and only if the CDFs are the same, i.e.:

To prove this, let’s examine the left side, and apply F on both sides of the equation:

Since for the uniform distribution over [0, 1] we have

we obtain, as desired:

(proof taken from COMP 480 / 580).

## Python Implementation

Let’s try this for the exponential distribution, defined by the pdf:

Taking the CDF and inverting it yields:

Thus giving us our desired sampling formula! Let’s convert this to Python.

We first define a function `exp_distr`

evaluating the pdf at the given `x`

values, and then a function `exp_distr_sampled`

sampling from the exponential distribution with the inverse transform method and our above derived formula.

Finally we plot the true pdf and a histogram of our sampled values:

`import numpy as np`

import matplotlib.pyplot as pltNUM_SAMPLES = 10000

RATE = 1.5

def exp_distr(x: np.ndarray, rate: float) -> np.ndarray:

return rate * np.exp(-x * rate)

def exp_distr_sampled(num_samples: int, rate: float) -> np.ndarray:

x = np.random.uniform(0, 1, num_samples)

return -1 / rate * np.log(1 - x)

x = np.linspace(0, 5, NUM_SAMPLES)

y_true = exp_distr(x, RATE)

y_sampled = exp_distr_sampled(NUM_SAMPLES, RATE)

plt.plot(x, y_true, "r-")

plt.hist(y_sampled, bins=30, density=True)

plt.show()

Running this code produces something like the below, showing that indeed our sampling produces correct results:

If possible, inverse transform sampling works perfectly. However, as mentioned, it requires us being able to invert the CDF. This is only possible for certain, simpler distributions — even for a normal distribution this is significantly more complex, albeit possible. And when this is too hard, or not possible — we have to resort to other methods, which we will cover in this section.

We start with rejection sampling and then introduce importance sampling, and conclude this article with a discussion of limitations of these methods and an outlook.

Note that both methods still require that we can evaluate `p(x)`

for a given `x`

.

## Rejection Sampling

Rejection sampling involves introducing a simpler, proposal distribution `q`

from which we can sample — and which we use to approximate `p`

. It follows the idea that `q(x) ≥ p(x)`

for all `x`

and we sample from `q`

, but reject all samples which are above `p`

— thus resulting exactly in the distribution `p`

eventually.

As mentioned above, it is possible to apply inverse transform sampling to normal distributions, but hard — thus for the sake of demonstration let us here try to approximate a normal distribution (`p`

), and use a uniform distribution as proposal distribution (`q`

). Let’s visualise above intuition in this setting:

Now with rejection sampling, we would sample from the uniform-like distribution `q`

, and reject all samples lying above `p`

— thus in the limit resulting in points filling out exactly the area under `p`

, i.e. sampling from this distribution.

Formally, we first select our proposal distribution `q`

, and then a scaling coefficient `k`

s.t. `kq(x) ≥ p(x)`

for all `x`

. Then, we sample a value `x_0`

from `q`

. Next, we generate a value `u_0`

from the uniform distribution over `[0, kq(x_0)]`

. If now `u_0 > p(x_0)`

, we reject the sample, otherwise we accept it.

Let’s implement this in Python:

`from typing import Tuple`import matplotlib.pyplot as plt

import numpy as np

NUM_SAMPLES = 10000

MEAN = 0

STANDARD_DEVIATION = 1

MIN_X = -4

MAX_X = 4

def uniform_distr(low: float, high: float) -> float:

return 1 / (high - low)

def normal_dist(

x: np.ndarray, mean: float, standard_deviation: float

) -> np.ndarray:

return (

1

/ (standard_deviation * np.sqrt(2 * np.pi))

* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)

)

x = np.linspace(MIN_X, MAX_X, NUM_SAMPLES)

y_true = normal_dist(x, MEAN, STANDARD_DEVIATION)

def rejection_sampling(

num_samples: int, min_x: float, max_x: float

) -> Tuple[np.ndarray, float]:

x = np.random.uniform(min_x, max_x, num_samples)

# uniform_distr(-4, 4) = 0.125 -> we need to scale this to ~0.5, i.e.

# select k = 4.

k = 4

u = np.random.uniform(np.zeros_like(x), uniform_distr(min_x, max_x) * k)

(idx,) = np.where(u < normal_dist(x, MEAN, STANDARD_DEVIATION))

return x[idx], len(idx) / num_samples

y_sampled, acceptance_prob = rejection_sampling(NUM_SAMPLES * 10, MIN_X, MAX_X)

print(f"Acceptance probability: {acceptance_prob}")

plt.plot(x, y_true, "r-")

plt.hist(y_sampled, bins=30, density=True)

plt.show()

Yielding the following result:

`Acceptance probability: 0.24774`

## Importance Sampling

Often, we are only interested in expectations, which is where importance sampling comes into play: it is a technique to approximate expectations (or other moments) from a complex distribution `p`

using again a proposal distribution `q`

.

The expectation of a (continuous) random variable X is defined as:

We then use a little mathematical trick to reformulate this into:

Now what does this formula say? It is the expectation of

under `q`

— i.e. to calculate `E[X]`

we can now evaluate this term when sampling from `q`

! The coefficients `p(x) / q(x)`

are called importance weights.

For the sake of demonstration, let’s again set `p`

to be a normal distribution — and use another normal distribution for `q`

:

`import numpy as np`NUM_SAMPLES = 10000

MEAN_P = 3

MEAN_Q = 2

def normal_dist(

x: np.ndarray, mean: float, standard_deviation: float

) -> np.ndarray:

return (

1

/ (standard_deviation * np.sqrt(2 * np.pi))

* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)

)

q_sampled = np.random.normal(loc=MEAN_Q, size=NUM_SAMPLES)

p_sampled = (

q_sampled

* normal_dist(q_sampled, MEAN_P, 1)

/ normal_dist(q_sampled, MEAN_Q, 1)

)

print(

f"Resulting expecation when sampling from q {np.mean(p_sampled)} vs true expecation {MEAN_P}"

)

In the code, we set `p`

to be a normal distribution with mean 3. Then, we sample `NUM_SAMPLES`

from the proposal distribution `q`

, which is a normal distribution with mean 2 — and use above introduced reformulation to calculate the expectation of `X`

under `p`

via this — yielding approximately the right result (~3 vs 3).

Let’s finish this section with a discussion about variance: the variance of the resulting samples will be

For our case, this means an increase in variance the more `p`

and `q`

vary among each other, i.e. disagree. To demonstrate, compare the variance for `MEAN_Q = 3.2`

and `5`

, we get ~0.20 and ~91.71, respectively.

Note however, that importance sampling is also commonly used as a variance reduction technique by cleverly choosing `q`

— however this might be a topic for a future post.

In this post we introduced three sampling methods: inverse transform sampling, rejection sampling, and importance sampling.

Inverse transform sampling can be used for relatively simple distributions, for which we know how to invert the CDF.

For more complex distributions, we have to resort to rejection or importance sampling. Still, for both we need to be able to evaluate the pdf of the distribution in question. Furthermore, there are other drawbacks, such as: rejection sampling is wasteful when we cannot “box” `p`

properly with `kq `

— this gets especially tricky in higher dimensions. Similarly for importance sampling, it is — especially in higher dimensions — hard to find good proposal distributions `q`

with suited importance weights.

If any of the above mentioned drawbacks are too severe, we have to resort to more powerful methods — e.g. from the family of Markov chain Monte Carlo methods (MCMC). These have way less strict requirements on the distributions we want to approximate, and suffer from less limitations, e.g. in high-dimensional spaces.

This finishes this introduction to sampling methods. Please note all images unless otherwise noted are by the author. I hope you enjoyed this post, thanks for reading!

**References:**

[1] Bishop, Christopher M., “Pattern Recognition and Machine Learning”, 2006, https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf