Optimisations, take two

I have talked about optimisations before and I mentioned that there are a few left to describe. In this post I will talk about two of them.

The first optimisation I would like to talk about concerns compilation flags. Somehow, everyone seems obsessed with them, like if they could somehow make a slow program fast. Alas, this is not the case. Firstly, and this is quite funny: the gcc flag “-O3” usually turns on all extra flags that people tend to give. One should really ponder upon this when looking at desperate attempts to speed up code. However, there is exactly one flag-combo that is very-very useful besides “-O3”: it’s “-fprofile-generate” and “-fprofile-use”. To understand why these are useful, we must first understand that one of the challenges faced by an optimising compiler is to try to guess how many times a loop will be executed,and how many times a branch will be taken. Given a good set of guesses, the loop can be unwound (or not) and the branches can be taken by default (or not). If compiled with “-fprofile-generate”, the program generates such information on-the-fly, which later can be used by “-fprofile-use”. The speedup achieved with such an approach in the realms of DPLL-based SAT solvers is relatively high, in the order of ~5-10%. I believe many SAT solvers’ binaries don’t use this trick, even though it’s cheap to use. I personally compile with “-fprofile-generate”, then run an example problem like UTI-20-10p1 on it, and then recompile with “-fprofile-use”. An important aspect, by the way, is to execute your problem with a very typical scenario: strange or non-normal scenarios will produce spurious branching and loop data which will actually slow down the program most of the time instead of speeding it up.

The second optimisation I would like to talk about concerns cache usage. Modern processors are extremely fast, to the point that many times what really limits the processor is no longer the ability to execute the code on the data, but to actually get the data that the code needs. This problem is usually alleviated through spatial and temporal data locality that is, that data most likely needed next is usually physically close to the one that we just accessed. However, in DPLL-based SAT solvers, the main function, propagate() goes through a list of pointers, accessing the data where each pointer points to. In other words, the CPU will fetch the next pointer in the background, but will not fetch where that pointer points to — simply because it doesn’t know it’s a pointer. So, how could we alleviate this problem? Well, by telling the CPU that the next piece of information that will be needed is where the next pointer points to. Like this:

where pointers is a vector of pointers that must each be treated. The function __builtin_prefetch() calls an assembly instruction that non-blockingly fetches the data pointed to into the Level 1 cache of the processor. This/these instruction(s) are not implemented in all CPUs, but many x86 and all AMD64 architectures have them. Essentially, while we are dealing with data where pointers[i] points at, we fetch the data where pointers[i+1] points at, in parallel. Interleaving the fetching of next data with the treatment of current data helps us do more things in less time, speeding up the program. Interestingly, since prefetching the data does not alter the flow of the program in any way, omitting the prefetch call does not alter the results of the program at all: it just makes it run slower. In CryptoMiniSat I use this optimisation, and it seems to speed up the solving by a couple of percentages, maybe 5-8%. The amount of time you will save on this depends on the locality and size of the information your pointers are pointing to. If they are very small, e.g. 2-3MB in total and are close to each other, then the CPU will fetch these after the first couple of pointers have been dereferenced, so there is no need to do explicit prefetching. But if the data you are dealing with is much larger, which is often the case with SAT solvers, then such explicit pre-fetching can save quite some time.