Monthly Archives: June 2012

Discrete probability integral transform for large N

You may or may not have heard of the probability integral transform, but it’s got to be up there as one of my favourite integral transforms (yeah – I have favourites). The basic idea is that you want a sample of random variables from some none-standard probability distribution, but all you have is a basic random number generator that spits out samples from the uniform distribution. With a bit of magic involving the cumulative distribution function (cdf) of your target distribution, and an inversion, you can turn that uniform sample into a sample in the target distribution. This is incredibly useful. (Check out Martin Hendry’s excellent SUPA data analysis course, section 6, for a nice introduction to the probability integral transform.)

This requires knowledge of the pdf of your target distribution, which may be non-trivial to evaluate (hint, hint: in large-scale structure calculations). In particular, there are lots of situations where you will only have a discrete representation of the pdf, sampled over some set of intervals. That’s not such a problem, though, because you can just approximate the integral transform as a sum.

One thing that is a problem, however, is the inversion part of the process. Because of the non-analytic nature of the problem, you’ll probably have to do some sort of nearest-neighbour search to do the inversion. (Essentially, given some value, you want to find the array index in y that lies closest to that value, and then retrieve that same index from a corresponding array x, where y = f(x).) If you want to draw a large number of values from the pdf, or if you’ve produced a lot of samples of the pdf, this can be computationally intensive.

argmin: The simple way

One solution to the problem is to use NumPy’s argmin() function, like so:

idxs = [(np.abs(cdf - x)).argmin() for x in X]

Here, cdf is the (discretely-sampled, normalised to interval [0, 1]) cumulative distribution function, and X is an array of draws from the Uniform distribution. The output is an array of indices, which can then be used to find the y corresponding to each x in X, like so:

y_samp = Y[idxs]

Unfortunately, this method doesn’t scale very well with the length of X or the number of samples in the cdf. It’s fine for small samples, though.

KDTree: The sophisticated way

If you do have large sample sizes, argmin() won’t cut it. Fortunately, SciPy has a nice C implementation of KD-tree nearest-neighbour search, which can be found in its spatial module. (There’s also a pure Python implementation, but that’s a lot slower for large samples.) All you need to do is build a tree, and then query it with your Uniform sample, X. For large samples, the speed-up when compared to the argmin() method, above, is breathtaking. Here’s some example code:

tree = scipy.spatial.cKDTree( cdf.reshape((cdf.size, 1)) )
dists, idxs = tree.query( X.reshape(X.size, 1) )

These functions require a bit of reshaping of the (1D) input arrays to get things right (the KDTree algorithm is designed to efficiently deal with k-dimensional data – hence the “K-D” part of its name – and so expects multi-dimensional arrays), but that can be done inexpensively. The query() method returns the distances between the input values and the nearest neighbours too, at no extra cost in computation.

What you’re left with is a super-fast nearest-neighbour search, that you could even extend to more dimensions (if you had a multivariate pdf, for example). This makes the probability integral transform much cheaper computationally.

Code Coffee: Parallelisation with OpenMP

Rich Booth gave us an introduction to using OpenMP to parallelise our code. It turns out to be surprisingly easy – all you need to do is add a specially-formed comment here and and there, and OpenMP will do the rest. At its most basic, OpenMP just takes a serial code (code meant to be run on a single processor), and splits the work between multiple threads, which can be run on multiple processors. This (hopefully) speeds up code execution by sharing the load. Rich gave examples for OpenMP in C and Fortran, but there are other parallelisation tools out there for other languages, like the multiprocessing module in Python. It was also mentioned that Matlab has excellent multi-processing capabilities, which tend to be quite easy to use.

Rich worked off this introduction to OpenMP, and showed us some basic examples of how for loops can be parallelised. He also discussed concepts like:

  • Scheduling: how the workload should be split up between different threads to make things as fast as possible (different scheduling strategies are available)
  • Scope: which variables should be private to a thread, which should be public and shared between all threads, and how to specify which is which
  • Preventing race conditions: ensuring that shared variables are updated in a coherent way by individual threads, e.g. by using atomic operations
  • Functions: Bundling code into a function which is then called from inside the parallelised block of code, to make it easier to keep track of private and shared variables

Other topics that were touched on include the difficulty of debugging parallelised code (the best way is to avoid having to debug by following a few good practises), the fact that OpenMP-enabled code can be run in “serial mode” by compiling with a standard compiler instead (the compiler will just throw up a few spurious warnings when it gets to the OpenMP bits), and that it’s best to only use OpenMP to parallelise a few well-defined tasks, like computationally-intensive for loops, rather than trying to parallelise everything (which can take a long time to code properly).

All in all, OpenMP looks like a nice, relatively stable way of speeding up operation that can be vectorised. It’s a lot simpler to implement than I thought, and doesn’t seem to require a load of code rewrites or anything as serious as that. There’s quite a bit of introductory tutorial material on the web, along with a few blogs dedicated to parallel programming. As usual, Wikipedia is helpful on concepts.

Code Coffee

Joe Zuntz and I recently started running an informal discussion group on computational techniques in astrophysics here in Oxford (“Code Coffee”). Pretty much all of the astrophysicists I know spend the majority of their time fighting with computers, so we thought it would be a nice way to get people together to collectively solve problems and share experience. It’s also a good way of learning about interesting techniques – astros tend not to have too much in the way of a formal computing education, so talking about even basic techniques can be valuable.

The format of the sessions is as follows: We start off with a brief, informal presentation on a computing topic, by a member of the department who has some interest or expertise in it. This is supposed to last five or ten minutes but almost invariable runs over, as people ask questions and try to figure out how best to apply what they’re learning to their own problems. I think this is proving to be quite handy to many of the attendees – there are lots of easy ways of improving your code, if only you know about them. We then move on to discussion or “show and tell”, where we use the collective experience in the room to helpĀ solve problems, discuss good practises, and so on. We also shovel biscuits down everyone’s throats.

So, that’s the idea behind it. After each Code Coffee, I’m going to post a brief summary of the discussion on this blog, along with web links to useful resources, mostly for the benefit of people who attended the sessions. Feel free to contribute in the comments, too.