Tag Archives: python

Reducing the padding in matplotlib figures

By default, matplotlib figures have a lot of padding around them. This is quite annoying when producing plots for publication, as you end up with a load of useless whitespace that you have to compensate for by screwing around with the figure sizing and position (especially in LaTeX articles).

Fortunately, matplotlib has a neat convenience function that removes all of this unnecessary whitespace in one fell swoop. It’s as easy as this:

plt.tight_layout()

See the figures below for a comparison. This function also takes a ‘pad’ argument that lets you fine-tune the padding manually.

Oh, how I wish I’d known about this three years ago.

Matplotlib figure with normal padding

Normal padding

Matplotlib figure with minimal padding

Minimal padding

Advertisements

Making Numpy arrays read-only

The CMB analysis framework that we’re writing at the moment needs to cache certain data (e.g. sky maps, maps of noise properties etc.) for quick access during various stages of the analysis. It’s important that other parts of the program don’t alter the cached information directly, since it could fall out of sync with the current state of the MCMC chain. To protect the cache, then, we need to make certain arrays read-only.

An easy way of doing this for Numpy arrays is to use the setflags() method on the array. The flag for modifying read/write access is, unsurprisingly, write, so to make an array called arr read-only, you would simply call arr.setflags(write=False). If you subsequently try to modify the array, it raises a ValueError, like so:

>>> a = np.arange(6)
>>> a.setflags(write=False)
>>> a[4] = 6
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only


Customising contour plots in matplotlib

So, you need to include a contour plot in some publication of yours, little one? There are two things that you must learn. But beware! The first will raise your spirits, while the second will quicken your descent into madness. These facts I address to you, should you stand to read them: (1) matplotlib has a really customisable contour plot implementation; (2) the convenience functions are few, and the necessary keyword arguments are confusing.

I’m prepping a bunch of contour plots for a publication in a quick letter at the moment, and want to improve their legibility. This involves things like making all of the lines thicker, increasing the font size of the labels (both on the axes and on the contours themselves), and changing the contour scaling. Some of these modifications have proven more difficult than others with matplotlib, so I thought I’d jot down a couple of examples here for reference. (The matplotlib contour examples are useful, but don’t cover everything.)

Tick size

First of all, let’s change the width and length of the ticks on each axis, and the size of the font of the labels for each tick.

import pylab as P

MP_LINEWIDTH = 2.4
MP_TICKSIZE = 10.

P.rc('axes', linewidth=MP_LINEWIDTH)

P.subplot(111)
for tick in P.gca().xaxis.get_major_ticks():

  tick.label1.set_fontsize(20.)
  tick.tick1line.set_markeredgewidth(MP_LINEWIDTH)
  tick.tick2line.set_markeredgewidth(MP_LINEWIDTH)
  tick.tick1line.set_markersize(0.5*MP_TICKSIZE)
  tick.tick2line.set_markersize(0.5*MP_TICKSIZE)

The call to P.rc() changes the thickness of all of the axis lines on the plot (i.e. it changes the global properties, and isn’t restricted to just one plot). The other stuff could probably be done using calls to rc(), but that’s an exercise for another time.

Plot of Compton y-distortion.

An example of a customised contour plot in matplotlib.

The loop is over all major ticks on the x axis of the current subplot. (The call to gca() gets the current set of plot axes, but of course you could use the xaxis.get_minor_ticks() method from any previously-defined axes object.) The object tick1line is the x-axis at the bottom of the plot, and tick2line is at the top.

Inside the loop, I’m setting the font size of the tick label (the number that appears below each tick), the width of the tick (using the markeredgewidth property), and the length of the tick (using markersize). You can also do things like hiding certain ticks, hiding certain labels (especially useful if you want to remove the tick label at the origin, because it overlaps with the label for the other axis there), changing the appearance of gridlines (somewhat unintuitively), and so on. There’s a list of all the tick-related objects that you can change here.

This will only change the tick styles for the minor ticks on the x axis. To cover all of the ticks on the plot, you’ll need to loop through all the minor ticks using P.gca().xaxis.get_minor_ticks() too, and then do both major and minor ticks for the y axis (using P.gca().yaxis) as well.

Changing which values contours are drawn at, and how they are labeled

To change where the contours are drawn, you need to change the contour locator. This is a keyword argument to contour(), and requires a ticker object. There’s a list of built-in tickers here. In the example below, I wanted a log scaling for my plots, so I used LogLocator().

It’s also useful to be able to change the labels that are added to the contours.For this, you need to use the clabel() function (which controls contour labels) and manipulate the formatter, using the fmt keyword. A list of formatters is given here, with more documentation on them (including formatter-specific keyword arguments for further customisation) given further down that page. Particularly useful formatters include:

  • LogFormatterMathtext (which add LaTeX mathmode labels, especially useful for numbers with exponents)
  • FormatStrFormatter (which lets you use a C-like format string to determine how significant figures, signs, etc. are displayed)
  • FuncFormatter (which lets you define a custom function to handle string formatting)

Setting the font size of the contour label is as easy as changing the fontsize keyword of clabel().

Finally, a subtlety of contour() is its use of the linewidths keyword, rather than linewidth (as with other pylab functions). It works exactly the same otherwise.

from matplotlib import ticker
ctr = P.contour(X, Y, Z, locator=ticker.LogLocator(), colors='k', linewidths=3.0)
P.clabel(ctr, inline=1, fontsize=20., fmt=ticker.LogFormatterMathtext())


Code Coffee: One-line guides to getting started in astro computing

I was away on my travels last week, so I missed the first Code Coffee of the new term here in Oxford. Joe Zuntz was present to officiate, though, and led a discussion (allegedly) for the benefit of our new doctoral students, who’d just arrived at that point. The result was a collection of “one-line guides” on getting started with astrophysics-style computing and programming, which I’ve copied below. (Be warned: It’s a little Oxford-specific in parts.)

Getting help

Answering Questions

Training

Don’t code maths!

Numerical Methods

General coding advice

Files

  • Don’t invent your own file formats.
  • FITS tables are not a great format. Consider using HDF instead.

Compilers

  • GNU: gcc, g++, gfortran (freely-available, cross-platform)
  • Intel: icc, ifort (tend to be faster for some applications)
  • Avoid mixing Intel/GNU compilers, and different versions of the compilers, when using Fortran. The compiled Fortran modules that are produced may not be compatible between different versions.

Best Practices


Handling Fortran binary files with Python

Today’s top tip, straight from the front lines of, err, cosmology, concerns the handling of Fortran binary files with Python. If you have to deal with legacy Fortran code, you might find that it has a penchant for outputting files in an “unformatted” binary format. Exactly how the data is stored in this format can apparently depend on a lot of factors, such as the endian-ness of your machine. Hurrah! The files are non-portable. The necessity to figure what exactly is being dumped into the file can also cause headaches if you’re trying to read Fortran output into a program written in another language (for example, C/C++).

And it gets worse – what if you’re trying to write a file to be read in by the Fortran code, using a different language? Well, this particular problem cropped up for me, and of course I wanted to use Python to solve it.

For Python, fortunately, Neil Martinsen-Burrell has written FortranFile, which is subclassed off the standard Python file object, and which provides some convenient functions for reading and writing Fortran-formatted files. (There’s also an entry in the SciPy Cookbook that is relevant for Fortran I/O.) The pertinent function for me was writeReals(), which writes a NumPy array to a file in the appropriate format. The only subtlety I came across while briefly fiddling with this was that, when constructing a new FortranFile object for writing, you have to pass the mode keyword (e.g. mode="w"), which is inherited from file. (There’s a thread with a couple of tips on using FortranFile here.)

There are probably better ways of doing this than downloading someone else’s code off the internet, but like I said, I didn’t want to spend all afternoon on it.


Drawing random numbers from a multivariate Gaussian distribution

I like software libraries. A software library means that someone has already done some of the heavy lifting for me. A software library means that I won’t have to reinvent the wheel. It means that I might not have to spend my afternoon fixing bugs and stringently validating my own code*.

It also means that I don’t necessarily have a good idea of what the computer is doing “behind the scenes”; that I’m potentially unaware of the clever (or not) algorithm that is being implemented, and what it’s limitations (or bugs) are.

For a long time now, I’ve been blithely accepting that numpy.random.multivariate_normal applies some sensible, but mysterious, procedure in order to generate random variables from a multivariate Gaussian (normal) distribution. And it does – I just never tried to unravel said mystery. So, prompted by Dag Sverre, and courtesy of Wikipedia,  I present a Python snippet that draws multivariate Gaussians given (a) some way of generating a set of standard univariate Gaussian random numbers, and (b) a Cholesky decomposition routine. In all, it’s about 10% faster than the NumPy routine, which applies a bunch of validity tests that I don’t.

import numpy as np

# Define mean, covariance matrix, and no. of samples required
mean = ... # some array of length N
cov = ... # some positive-definite NxN matrix
Ndraws = 1000

# Do factorisation (can store this and use it again later)
L = np.linalg.cholesky(cov)

# Get 3*Ndraws Gaussian random variables (mean=0, variance=1)
norm = np.random.normal(size=Ndraws*3).reshape(3, Ndraws)

# Construct final set of random numbers (with correct mean)
rand = mean + np.dot(L, norm)

Of course, this now introduces two new mysteries: How do you efficiently implement Cholesky decomposition, and how do you generate univariate Gaussian random numbers? Those are exercises for another day, I think.

* It depends on how sloppy I’m feeling; of course, you should validate everything that matters, regardless of who wrote it.


Getting codes in different languages to interact

In a Code Coffee meeting last month (9th July), we discussed ways of getting codes in different languages to talk to one another. This is quite a common task; for example, if you need your modern data analysis pipeline to interact with some legacy code (normally Fortran!), or if you have a mostly Python code that has a computationally-intensive bit that would be faster written in C. Our focus was on Python, C, and Fortran, since they seem to be the most commonly used languages in the group.

People brought a number of different solutions with them. Some were very easy to implement, just as a simple few-line block in your Python code that could be used to speed up a loop. Others were quite a bit more complicated, requiring extra work on building modules and getting your hands dirty with memory management, but having the virtue of being much more flexible. The methods we discussed are summarised below, in increasing order of complexity. If you skip to the bottom, there’s also a link to a tarball, with some examples.

(Edit: Be sure to check out some of the interesting comments, below, too. They mostly concern Cython.)

Weave

Weave is a way of writing C code “inline” – directly into a Python script.  Python variables are automatically made available to the C code (you sometimes need to do a bit of extra work to cast them to a specific type, but not much), which makes it nice and easy to use, with little boilerplate required in the C code. Essentially, you can just stick a fast, few-line C loop in the middle of a Python function, then call Weave and it’ll handle compilation and shuffling data between Python and C for you automagically, without requiring any complicated wrapper code or makefile magic. There’s also a “blitz” mode, which takes a Python/NumPy expression and converts it into C++ in the background, which can also help speed things up (although it’s generally slower than the inline mode).

Unfortunately, the documentation isn’t great, but there are plenty of examples out there (SageMath have a couple of very clear ones). It should definitely be your first port of call if all you want to do is speed up part of your Python code. It’s not so good if you want to interface with legacy code, or have something more complicated that you need to do in C (e.g. something that’s split into different functions). See also Cython, which seems to be a bit more flexible, but not too much more difficult to use. There’s a nice speed comparison with NumPy, Weave blitz and inline, and MATLAB here, and a Cython/Weave example (with benchmarks) here. There’s also an interesting blog post on getting GSL to work with Weave/Cython.

f2py

f2py is a way of automatically generating Python modules from Fortran code. It seems pretty easy to use – all you need to do is run it over your Fortran source code, and it will produce a compiled module that you can simply import into your Python code and use as you would any other. The f2py program scans the Fortran code and figures out how to deal with function arguments and convert between types itself, which is very handy. It even generates function documentation (docstrings) for the Python module, telling you what format the function arguments need to be in. One problem with f2py, pointed out by Ryan, is that it tends to fail silently, making it difficult to debug! It’s apparently quite robust though, so this probably won’t be necessary that often. The documentation looks pretty good, and you can find other resources here. (N.B. They ask you to cite their paper if you use f2py in research code.)

Python ctypes

ctypes is a Python module for handling calls to shared libraries (whether they are written in C/C++ or Fortran). The syntax for loading libraries and calling functions is exceptionally simple – little different from importing a native Python module and calling functions in that. However, there’s some added complexity that comes as a result of having to convert between C and Python data types in function arguments. For the most part, all you need to do is call a simple ctypes helper function to make the conversion; e.g. c_double(42.3) will give you something that can be passed to a function that expects a C double as an argument. In more recent versions of NumPy, ndarrays also have ctypes conversion methods (see the ndarray.ctypes.data_as() method; you probably want to get a pointer to the array, for which you should use ctypes.POINTER), although there are also some subtleties to be aware of in terms of how Python/NumPy and C/C++ or Fortran store multidimensional arrays in memory (see the order property of NumPy arrays, for example). The Sage website has a good tutorial on working with ndarrays in ctypes. A slightly nicer way of handling types conversion is to write a small Python wrapper for the C function (or whatever), and use the argtypes attribute to specify what arguments the function accepts. A comprehensive tutorial can be found here, which explains most of the workings of ctypes, and which has good examples.

The tutorial on the Python website mostly concerns calling existing system libraries, but of course you can write your own libraries too. For C/C++ code compiled with GCC, this is pretty simple – all you need to do is add a couple of compiler flags (see Section 3.4 of the TDLP Library Program tutorial). Of course, your C code should be set out like a library first – just compiling something that’s meant to be a standalone program as a library won’t work too well, and you’ll need to write header files etc. That’s easy to do, and existing “non-library” code can often be refactored into a working library with very little effort. There’s a very nice, simple tutorial on writing and compiling libraries here. Note that the libraries don’t need any “special treatment” to interface with Python using ctypes – they’re just standard C/Fortran libraries, and can happily be called by other C/Fortran programs too.

Python C API

Big chunks of Python are written in C, so it should come as no surprise that there is a C library which provides access to Python data types and functions. Access it by including Python.h in your C code. I’ll keep discussion of this one to a deliberately high-level, because it’s significantly more complicated than the others, and doesn’t seem to be fantastically well documented (although see the official Python docs for it here and here). This method allows you to write new Python modules directly in C; the end result will be something that looks and behaves exactly like a standard Python module (of the sort that you may have written before, in Python), which requires no fiddling about with data types or what have you in the Python code that uses it. That’s because all of the fiddling around is done in the C code! I find this quite a bit trickier to write – your C files need some boilerplate to get themselves (and their constituent functions) recognised by the Python interpreter, and there’s a bit of makefile magic required to actually build and install the module too. Plus, you have to figure out how to handle type conversions between native C types and Python types (mostly PyObjects). It’s not super-difficult, but it is fiddly in places. Take a look at this nice tutorial by Dan Foreman-Mackey (and another one from JPL which looks more specifically at handling NumPy arrays).

Some of the more confusing issues that you’re likely to run into are to do with reference counting. This is a reasonably technical computer science concept, which has to do with how Python stores and labels variables. (Recall that, in Python, the same data can be referred to by multiple variable names.) It’s important to make sure your C code properly tracks references to Python objects as it manipulates them; otherwise, you’re likely to run into a host of problems, from memory leaks to weird crashes. If you’re going to use the Python C API, I strongly recommend that you invest a good amount of time in understanding how this works.

All in all, I think the C API is only the way forward if you’re specifically setting out to write a new Python module for use elsewhere; if all you have is some existing C code that you want to quickly plug in to some Python code, it’s going to be a lot of hassle. Still, once you’ve set up a project once, and used the C API a little bit, it’s a lot quicker to get up and running with the next project.

Summary

All in all, ctypes is probably the best route to go for most people, unless you have something particularly trivial you want to speed-up. It’s fast (native C function calls), easy enough to set up, and conceptually simple to understand. It’s not quite as “automatic” as Weave/Cython or f2py, but you’ll probably end up with cleaner, more flexible, and more robust code for your troubles. And it’s much easier to work with than the Python C API.

The Performance Python page has a nice comparison of some of the methods mentioned above. A tarball containing example code for most of the above methods can be downloaded from here; thanks to Ryan Houghton, Neale Gibson and Joe Zuntz for providing these.