You may or may not have heard of the probability integral transform, but it’s got to be up there as one of my favourite integral transforms (yeah – I have favourites). The basic idea is that you want a sample of random variables from some none-standard probability distribution, but all you have is a basic random number generator that spits out samples from the uniform distribution. With a bit of magic involving the cumulative distribution function (cdf) of your target distribution, and an inversion, you can turn that uniform sample into a sample in the target distribution. This is incredibly useful. (Check out Martin Hendry’s excellent SUPA data analysis course, section 6, for a nice introduction to the probability integral transform.)

This requires knowledge of the pdf of your target distribution, which may be non-trivial to evaluate (hint, hint: in large-scale structure calculations). In particular, there are lots of situations where you will only have a discrete representation of the pdf, sampled over some set of intervals. That’s not such a problem, though, because you can just approximate the integral transform as a sum.

One thing that is a problem, however, is the inversion part of the process. Because of the non-analytic nature of the problem, you’ll probably have to do some sort of nearest-neighbour search to do the inversion. (Essentially, given some value, you want to find the array index in *y* that lies closest to that value, and then retrieve that same index from a corresponding array *x*, where *y = f(x)*.) If you want to draw a large number of values from the pdf, or if you’ve produced a lot of samples of the pdf, this can be computationally intensive.

## argmin: The simple way

One solution to the problem is to use NumPy’s argmin() function, like so:

idxs = [(np.abs(cdf - x)).argmin() for x in X]

Here, cdf is the (discretely-sampled, normalised to interval [0, 1]) cumulative distribution function, and *X* is an array of draws from the Uniform distribution. The output is an array of indices, which can then be used to find the *y* corresponding to each *x* in *X*, like so:

y_samp = Y[idxs]

Unfortunately, this method doesn’t scale very well with the length of *X* or the number of samples in the cdf. It’s fine for small samples, though.

## KDTree: The sophisticated way

If you do have large sample sizes, argmin() won’t cut it. Fortunately, SciPy has a nice C implementation of KD-tree nearest-neighbour search, which can be found in its spatial module. (There’s also a pure Python implementation, but that’s a lot slower for large samples.) All you need to do is build a tree, and then query it with your Uniform sample, *X*. For large samples, the speed-up when compared to the argmin() method, above, is breathtaking. Here’s some example code:

tree = scipy.spatial.cKDTree( cdf.reshape((cdf.size, 1)) ) dists, idxs = tree.query( X.reshape(X.size, 1) )

These functions require a bit of reshaping of the (1D) input arrays to get things right (the KDTree algorithm is designed to efficiently deal with *k*-dimensional data – hence the “K-D” part of its name – and so expects multi-dimensional arrays), but that can be done inexpensively. The query() method returns the distances between the input values and the nearest neighbours too, at no extra cost in computation.

What you’re left with is a super-fast nearest-neighbour search, that you could even extend to more dimensions (if you had a multivariate pdf, for example). This makes the probability integral transform much cheaper computationally.

## Leave a Reply