Monthly Archives: August 2012

Getting codes in different languages to interact

In a Code Coffee meeting last month (9th July), we discussed ways of getting codes in different languages to talk to one another. This is quite a common task; for example, if you need your modern data analysis pipeline to interact with some legacy code (normally Fortran!), or if you have a mostly Python code that has a computationally-intensive bit that would be faster written in C. Our focus was on Python, C, and Fortran, since they seem to be the most commonly used languages in the group.

People brought a number of different solutions with them. Some were very easy to implement, just as a simple few-line block in your Python code that could be used to speed up a loop. Others were quite a bit more complicated, requiring extra work on building modules and getting your hands dirty with memory management, but having the virtue of being much more flexible. The methods we discussed are summarised below, in increasing order of complexity. If you skip to the bottom, there’s also a link to a tarball, with some examples.

(Edit: Be sure to check out some of the interesting comments, below, too. They mostly concern Cython.)

Weave

Weave is a way of writing C code “inline” – directly into a Python script.  Python variables are automatically made available to the C code (you sometimes need to do a bit of extra work to cast them to a specific type, but not much), which makes it nice and easy to use, with little boilerplate required in the C code. Essentially, you can just stick a fast, few-line C loop in the middle of a Python function, then call Weave and it’ll handle compilation and shuffling data between Python and C for you automagically, without requiring any complicated wrapper code or makefile magic. There’s also a “blitz” mode, which takes a Python/NumPy expression and converts it into C++ in the background, which can also help speed things up (although it’s generally slower than the inline mode).

Unfortunately, the documentation isn’t great, but there are plenty of examples out there (SageMath have a couple of very clear ones). It should definitely be your first port of call if all you want to do is speed up part of your Python code. It’s not so good if you want to interface with legacy code, or have something more complicated that you need to do in C (e.g. something that’s split into different functions). See also Cython, which seems to be a bit more flexible, but not too much more difficult to use. There’s a nice speed comparison with NumPy, Weave blitz and inline, and MATLAB here, and a Cython/Weave example (with benchmarks) here. There’s also an interesting blog post on getting GSL to work with Weave/Cython.

f2py

f2py is a way of automatically generating Python modules from Fortran code. It seems pretty easy to use – all you need to do is run it over your Fortran source code, and it will produce a compiled module that you can simply import into your Python code and use as you would any other. The f2py program scans the Fortran code and figures out how to deal with function arguments and convert between types itself, which is very handy. It even generates function documentation (docstrings) for the Python module, telling you what format the function arguments need to be in. One problem with f2py, pointed out by Ryan, is that it tends to fail silently, making it difficult to debug! It’s apparently quite robust though, so this probably won’t be necessary that often. The documentation looks pretty good, and you can find other resources here. (N.B. They ask you to cite their paper if you use f2py in research code.)

Python ctypes

ctypes is a Python module for handling calls to shared libraries (whether they are written in C/C++ or Fortran). The syntax for loading libraries and calling functions is exceptionally simple – little different from importing a native Python module and calling functions in that. However, there’s some added complexity that comes as a result of having to convert between C and Python data types in function arguments. For the most part, all you need to do is call a simple ctypes helper function to make the conversion; e.g. c_double(42.3) will give you something that can be passed to a function that expects a C double as an argument. In more recent versions of NumPy, ndarrays also have ctypes conversion methods (see the ndarray.ctypes.data_as() method; you probably want to get a pointer to the array, for which you should use ctypes.POINTER), although there are also some subtleties to be aware of in terms of how Python/NumPy and C/C++ or Fortran store multidimensional arrays in memory (see the order property of NumPy arrays, for example). The Sage website has a good tutorial on working with ndarrays in ctypes. A slightly nicer way of handling types conversion is to write a small Python wrapper for the C function (or whatever), and use the argtypes attribute to specify what arguments the function accepts. A comprehensive tutorial can be found here, which explains most of the workings of ctypes, and which has good examples.

The tutorial on the Python website mostly concerns calling existing system libraries, but of course you can write your own libraries too. For C/C++ code compiled with GCC, this is pretty simple – all you need to do is add a couple of compiler flags (see Section 3.4 of the TDLP Library Program tutorial). Of course, your C code should be set out like a library first – just compiling something that’s meant to be a standalone program as a library won’t work too well, and you’ll need to write header files etc. That’s easy to do, and existing “non-library” code can often be refactored into a working library with very little effort. There’s a very nice, simple tutorial on writing and compiling libraries here. Note that the libraries don’t need any “special treatment” to interface with Python using ctypes – they’re just standard C/Fortran libraries, and can happily be called by other C/Fortran programs too.

Python C API

Big chunks of Python are written in C, so it should come as no surprise that there is a C library which provides access to Python data types and functions. Access it by including Python.h in your C code. I’ll keep discussion of this one to a deliberately high-level, because it’s significantly more complicated than the others, and doesn’t seem to be fantastically well documented (although see the official Python docs for it here and here). This method allows you to write new Python modules directly in C; the end result will be something that looks and behaves exactly like a standard Python module (of the sort that you may have written before, in Python), which requires no fiddling about with data types or what have you in the Python code that uses it. That’s because all of the fiddling around is done in the C code! I find this quite a bit trickier to write – your C files need some boilerplate to get themselves (and their constituent functions) recognised by the Python interpreter, and there’s a bit of makefile magic required to actually build and install the module too. Plus, you have to figure out how to handle type conversions between native C types and Python types (mostly PyObjects). It’s not super-difficult, but it is fiddly in places. Take a look at this nice tutorial by Dan Foreman-Mackey (and another one from JPL which looks more specifically at handling NumPy arrays).

Some of the more confusing issues that you’re likely to run into are to do with reference counting. This is a reasonably technical computer science concept, which has to do with how Python stores and labels variables. (Recall that, in Python, the same data can be referred to by multiple variable names.) It’s important to make sure your C code properly tracks references to Python objects as it manipulates them; otherwise, you’re likely to run into a host of problems, from memory leaks to weird crashes. If you’re going to use the Python C API, I strongly recommend that you invest a good amount of time in understanding how this works.

All in all, I think the C API is only the way forward if you’re specifically setting out to write a new Python module for use elsewhere; if all you have is some existing C code that you want to quickly plug in to some Python code, it’s going to be a lot of hassle. Still, once you’ve set up a project once, and used the C API a little bit, it’s a lot quicker to get up and running with the next project.

Summary

All in all, ctypes is probably the best route to go for most people, unless you have something particularly trivial you want to speed-up. It’s fast (native C function calls), easy enough to set up, and conceptually simple to understand. It’s not quite as “automatic” as Weave/Cython or f2py, but you’ll probably end up with cleaner, more flexible, and more robust code for your troubles. And it’s much easier to work with than the Python C API.

The Performance Python page has a nice comparison of some of the methods mentioned above. A tarball containing example code for most of the above methods can be downloaded from here; thanks to Ryan Houghton, Neale Gibson and Joe Zuntz for providing these.

Advertisements

SDSS DR9 data release is out

The much-anticipated (by me, anyway) SDSS-III DR9 data release is out! Like many other people, I love mucking around with big datasets like this, and the SDSS team have done a really nice job of trying to make the data as accessible as possible. An integral part of any serious public data release is documentation, and the SDSS crew have excelled at that too, from what I’ve seen so far: check out the various interesting tutorials and basic code snippets that will help you to get up and running.