Tag Archives: computing

Book Review: How Software Works

Another month, another book kindly sent for review from No Starch. This one’s of the more conceptual variety, and sets out to explain – in laymen’s terms – the algorithms that are responsible for much of the “magic” of modern technology.


When I was younger, and just getting into computers, I used to spend hours reading second-hand software and hardware manuals. I think I must have read the manual for my old computer’s motherboard 50 times. A kindly network engineer from the PC Pro forums (ah, those were the days) sent me an old 400-page networking manual that I inhaled too. Manuals were the best. Manuals showed me what computers were capable of.

It wasn’t until later that I became aware of where the real action is. Algorithms. Manuals are a fine thing, but they’re typically written at a high-level. They tell you what’s happening – and which buttons to press – but, for expediency, often skip over the “how” – the way the mysterious feats that first got me fired up about computers are actually achieved.

This book, How Software Works (by V. Anton Spraul), is all about the “how”. You won’t find any practical manual-type information in here at all, so don’t expect to come out the other side of this book with a finely-honed knowledge of printer troubleshooting or anything like that. No, this is a very pure book that explains, in uncompromisingly non-technical terms, how computers achieve their magic.

Each chapter covers a broad but real-world relevant topic, such as web security, movie CGI, or mapping. After some background on each topic, Spraul sketches out the most important pieces of the algorithmic puzzle needed to produce the “everyday” results we now take for granted in movies, on the web, and in our smartphone apps. This might include a walkthrough of the logic behind a trapdoor function, of the sort that that makes public key encryption possible (which in turn makes internet shopping practical). Or perhaps the step-by-step process by which a rendering program builds up a realistic virtual scene in a movie, through ray tracing.

The writing is very clear and non-technical, almost without exception, and assumes very little prior knowledge. You do not need a technical background to understand this book, but you’ll want to spend some time to follow the examples and ruminate on them a little to really get everything. The examples themselves are plentiful, and include step-by-step illustrations of simplified situations that, when linked together, demonstrate how each algorithm works as a whole.

Given that this is a book on software, it’s slightly disappointing that the presentation is completely “dead-tree traditional”. By this, I mean that there’s no supplementary material in the form of working code snippets that one could play with, or interactive demonstrations. This feels like a missed opportunity, at least for those of us who learn best by tinkering (c.f. the excellent W3Schools “Try It Yourself” tutorials). It’d also turn the book into a more direct educational tool, perhaps something that a class could be based on – and there are enough simple web-based programming systems out there to remove much of the burden of having to “teach” programming in the first place. This is more of a wishlist item than a crucially missing piece, however.

Another minor criticism is the length of the book. It would have been nice to see a few more topics covered, or perhaps a little more detail in the final chapters. The material on searching could go into more detail in explaining how web search works, for example, including things like how robots/crawlers and ranking algorithms (e.g. PageRank) actually do their thing. As it is, it feels like the author ran out of steam before getting to the real crux of this topic.

All in all, it’s a very nice book, and I learned a lot about some interesting, highly-relevant techniques that I was only dimly aware existed. The material on encryption in particular outlines a clever and essentially mathematical topic that will speak to those of you who enjoy logic puzzles, for example. I’m not quite sure who the intended audience is for the book as a whole, but it’s definitely something to keep in mind for an aspiring techie – a teenager who’s still reading the manuals, perhaps, and is ready to have their horizons broadened. The mechanically-minded, those with a fundamental curiosity about how things work, will also enjoy.


Book Review: How Linux Works 2

I received a review copy of How Linux Works 2, by Brian Ward, from the lovely folks at No Starch Press (my publisher) late last year. Inexcusably, it’s taken me until now to put together a proper review; here it is, with profuse apologies for the delay!

How Linux Works 2

How Linux Works 2 is a very nice technical read. I’ve been a user and administrator of Linux systems for over a decade now, and can safely say I learned a lot of new stuff from both angles. Newer users will probably get even more from it – although absolute beginners with less of a technical bent might be better off looking elsewhere.

The book fills something of a niche; it’s not a standard manual-type offering, nor is it a technical system reference. It’s more impressionistic than either of those, written as a sort of overview of the organisation and concepts that go into a generic Linux system, although with specific details scattered throughout that really get into the nuts and bolts of things. If you’re looking for “how-to”-type instructions, you’re unlikely to find everything you need here, and it isn’t a comprehensive reference guide either. But if you’re technically-minded and want to understand the essentials of how most Linux distros work in considerable (but not absolute) depth, with a bit of getting your hands dirty, then it’s a great book to have on your shelf.

Various technical concepts are covered ably and concisely, and was I left with a much better feeling for more mysterious Linux components – like the networking subsystem – than I had before. There are practical details here as well though, and you’ll find brief, high-level overviews of a number of useful commands and utilities that are sufficient to give a flavour for what they’re good for without getting too caught up in the (often idiosyncratic) specifics of their usage.

That said, the author does sometimes slip into “how-to” mode, giving more details about how to use certain tools. While this is fine in moderation, the choice of digression is sometimes unusual – for example, file sharing with Samba is awarded a whole six pages (and ten subsections) of usage-specifics, while the arguably more fundamental CUPS printing subsystem has to make do with less than 2 pages. The discussion of SSH is also quite limited, despite the importance of this tool from both the user’s and administrator’s perspective, and desktop environments probably could have done with a bit more than a brief single-chapter overview. Still, this book really isn’t intended as a manual, and the author has done well not to stray too far in this direction.

A common difficulty for Linux books is the great deal of variation between distros. Authors often struggle with where to draw the line between complete (but superficial) distro-agnostic generality and more useful, but audience-limiting, distro specifics. How Linux Works succeeds admirably in walking this tightrope, providing sufficient detail to be useful to users of more or less any Linux system without repeatedly dropping into tiresome list-like “distro by distro” discussions. This isn’t always successful – the preponderance of init systems in modern distros has necessitated a long and somewhat dull enumeration of three of the most common options, for example – but HLW2 does much better at handling this than most books I’ve seen. The upshot is that the writing is fluid and interesting for the most part, without too many of the “painful but necessary” digressions that plague technical writing.

Overall, this book is an enjoyable and informative read for anyone interested in, well, how Linux works! You’ll get an essential understanding of what’s going on under the hood without getting bogged down in minutiae – making this a very refreshing (and wholly recommended) addition to the Linux literature.

You can find a sample chapter and table of contents/index on the No Starch website.

Discrete probability integral transform for large N

You may or may not have heard of the probability integral transform, but it’s got to be up there as one of my favourite integral transforms (yeah – I have favourites). The basic idea is that you want a sample of random variables from some none-standard probability distribution, but all you have is a basic random number generator that spits out samples from the uniform distribution. With a bit of magic involving the cumulative distribution function (cdf) of your target distribution, and an inversion, you can turn that uniform sample into a sample in the target distribution. This is incredibly useful. (Check out Martin Hendry’s excellent SUPA data analysis course, section 6, for a nice introduction to the probability integral transform.)

This requires knowledge of the pdf of your target distribution, which may be non-trivial to evaluate (hint, hint: in large-scale structure calculations). In particular, there are lots of situations where you will only have a discrete representation of the pdf, sampled over some set of intervals. That’s not such a problem, though, because you can just approximate the integral transform as a sum.

One thing that is a problem, however, is the inversion part of the process. Because of the non-analytic nature of the problem, you’ll probably have to do some sort of nearest-neighbour search to do the inversion. (Essentially, given some value, you want to find the array index in y that lies closest to that value, and then retrieve that same index from a corresponding array x, where y = f(x).) If you want to draw a large number of values from the pdf, or if you’ve produced a lot of samples of the pdf, this can be computationally intensive.

argmin: The simple way

One solution to the problem is to use NumPy’s argmin() function, like so:

idxs = [(np.abs(cdf - x)).argmin() for x in X]

Here, cdf is the (discretely-sampled, normalised to interval [0, 1]) cumulative distribution function, and X is an array of draws from the Uniform distribution. The output is an array of indices, which can then be used to find the y corresponding to each x in X, like so:

y_samp = Y[idxs]

Unfortunately, this method doesn’t scale very well with the length of X or the number of samples in the cdf. It’s fine for small samples, though.

KDTree: The sophisticated way

If you do have large sample sizes, argmin() won’t cut it. Fortunately, SciPy has a nice C implementation of KD-tree nearest-neighbour search, which can be found in its spatial module. (There’s also a pure Python implementation, but that’s a lot slower for large samples.) All you need to do is build a tree, and then query it with your Uniform sample, X. For large samples, the speed-up when compared to the argmin() method, above, is breathtaking. Here’s some example code:

tree = scipy.spatial.cKDTree( cdf.reshape((cdf.size, 1)) )
dists, idxs = tree.query( X.reshape(X.size, 1) )

These functions require a bit of reshaping of the (1D) input arrays to get things right (the KDTree algorithm is designed to efficiently deal with k-dimensional data – hence the “K-D” part of its name – and so expects multi-dimensional arrays), but that can be done inexpensively. The query() method returns the distances between the input values and the nearest neighbours too, at no extra cost in computation.

What you’re left with is a super-fast nearest-neighbour search, that you could even extend to more dimensions (if you had a multivariate pdf, for example). This makes the probability integral transform much cheaper computationally.

Code Coffee: Parallelisation with OpenMP

Rich Booth gave us an introduction to using OpenMP to parallelise our code. It turns out to be surprisingly easy – all you need to do is add a specially-formed comment here and and there, and OpenMP will do the rest. At its most basic, OpenMP just takes a serial code (code meant to be run on a single processor), and splits the work between multiple threads, which can be run on multiple processors. This (hopefully) speeds up code execution by sharing the load. Rich gave examples for OpenMP in C and Fortran, but there are other parallelisation tools out there for other languages, like the multiprocessing module in Python. It was also mentioned that Matlab has excellent multi-processing capabilities, which tend to be quite easy to use.

Rich worked off this introduction to OpenMP, and showed us some basic examples of how for loops can be parallelised. He also discussed concepts like:

  • Scheduling: how the workload should be split up between different threads to make things as fast as possible (different scheduling strategies are available)
  • Scope: which variables should be private to a thread, which should be public and shared between all threads, and how to specify which is which
  • Preventing race conditions: ensuring that shared variables are updated in a coherent way by individual threads, e.g. by using atomic operations
  • Functions: Bundling code into a function which is then called from inside the parallelised block of code, to make it easier to keep track of private and shared variables

Other topics that were touched on include the difficulty of debugging parallelised code (the best way is to avoid having to debug by following a few good practises), the fact that OpenMP-enabled code can be run in “serial mode” by compiling with a standard compiler instead (the compiler will just throw up a few spurious warnings when it gets to the OpenMP bits), and that it’s best to only use OpenMP to parallelise a few well-defined tasks, like computationally-intensive for loops, rather than trying to parallelise everything (which can take a long time to code properly).

All in all, OpenMP looks like a nice, relatively stable way of speeding up operation that can be vectorised. It’s a lot simpler to implement than I thought, and doesn’t seem to require a load of code rewrites or anything as serious as that. There’s quite a bit of introductory tutorial material on the web, along with a few blogs dedicated to parallel programming. As usual, Wikipedia is helpful on concepts.

Code Coffee

Joe Zuntz and I recently started running an informal discussion group on computational techniques in astrophysics here in Oxford (“Code Coffee”). Pretty much all of the astrophysicists I know spend the majority of their time fighting with computers, so we thought it would be a nice way to get people together to collectively solve problems and share experience. It’s also a good way of learning about interesting techniques – astros tend not to have too much in the way of a formal computing education, so talking about even basic techniques can be valuable.

The format of the sessions is as follows: We start off with a brief, informal presentation on a computing topic, by a member of the department who has some interest or expertise in it. This is supposed to last five or ten minutes but almost invariable runs over, as people ask questions and try to figure out how best to apply what they’re learning to their own problems. I think this is proving to be quite handy to many of the attendees – there are lots of easy ways of improving your code, if only you know about them. We then move on to discussion or “show and tell”, where we use the collective experience in the room to help solve problems, discuss good practises, and so on. We also shovel biscuits down everyone’s throats.

So, that’s the idea behind it. After each Code Coffee, I’m going to post a brief summary of the discussion on this blog, along with web links to useful resources, mostly for the benefit of people who attended the sessions. Feel free to contribute in the comments, too.

Colours in the Mac OS X terminal

My snobbish conviction that Mac OS X is somehow an inferior operating system has been growing of late. I long to install Ubuntu 12.04 on my Power Mac at work (centrally managed by Oxford Physics IT services), just so that I don’t have to recompile every bloody component of my toolchain every time I start a new project.

Anyway, for evidence that Mac OS X isn’t a “proper” operating system, note the lack of colours in the default terminal profile. Microsoft Windows (vehemently not a proper operating system) also lacks colours on the command line by default. You do the maths.

To get colours in the terminal on a Mac, add the following to your ~/.bashrc:

export CLICOLOR = 1