Tag Archives: SAT

Optimisations, take two

I have talked about optimisations before and I mentioned that there are a few left to describe. In this post I will talk about two of them.

The first optimisation I would like to talk about concerns compilation flags. Somehow, everyone seems obsessed with them, like if they could somehow make a slow program fast. Alas, this is not the case. Firstly, and this is quite funny: the gcc flag “-O3” usually turns on all extra flags that people tend to give. One should really ponder upon this when looking at desperate attempts to speed up code. However, there is exactly one flag-combo that is very-very useful besides “-O3”: it’s “-fprofile-generate” and “-fprofile-use”. To understand why these are useful, we must first understand that one of the challenges faced by an optimising compiler is to try to guess how many times a loop will be executed,and how many times a branch will be taken. Given a good set of guesses, the loop can be unwound (or not) and the branches can be taken by default (or not). If compiled with “-fprofile-generate”, the program generates such information on-the-fly, which later can be used by “-fprofile-use”. The speedup achieved with such an approach in the realms of DPLL-based SAT solvers is relatively high, in the order of ~5-10%. I believe many SAT solvers’ binaries don’t use this trick, even though it’s cheap to use. I personally compile with “-fprofile-generate”, then run an example problem like UTI-20-10p1 on it, and then recompile with “-fprofile-use”. An important aspect, by the way, is to execute your problem with a very typical scenario: strange or non-normal scenarios will produce spurious branching and loop data which will actually slow down the program most of the time instead of speeding it up.

The second optimisation I would like to talk about concerns cache usage. Modern processors are extremely fast, to the point that many times what really limits the processor is no longer the ability to execute the code on the data, but to actually get the data that the code needs. This problem is usually alleviated through spatial and temporal data locality that is, that data most likely needed next is usually physically close to the one that we just accessed. However, in DPLL-based SAT solvers, the main function, propagate() goes through a list of pointers, accessing the data where each pointer points to. In other words, the CPU will fetch the next pointer in the background, but will not fetch where that pointer points to — simply because it doesn’t know it’s a pointer. So, how could we alleviate this problem? Well, by telling the CPU that the next piece of information that will be needed is where the next pointer points to. Like this:

for (i = 0; i < pointers.size();  i++) {
    if (i+1 < pointers.size())
        __builtin_prefetch(pointers[i+1]);
    treat_data(*pointers[i]);
}

where pointers is a vector of pointers that must each be treated. The function __builtin_prefetch() calls an assembly instruction that non-blockingly fetches the data pointed to into the Level 1 cache of the processor. This/these instruction(s) are not implemented in all CPUs, but many x86 and all AMD64 architectures have them. Essentially, while we are dealing with data where pointers[i] points at, we fetch the data where pointers[i+1] points at, in parallel. Interleaving the fetching of next data with the treatment of current data helps us do more things in less time, speeding up the program. Interestingly, since prefetching the data does not alter the flow of the program in any way, omitting the prefetch call does not alter the results of the program at all: it just makes it run slower. In CryptoMiniSat I use this optimisation, and it seems to speed up the solving by a couple of percentages, maybe 5-8%. The amount of time you will save on this depends on the locality and size of the information your pointers are pointing to. If they are very small, e.g. 2-3MB in total and are close to each other, then the CPU will fetch these after the first couple of pointers have been dereferenced, so there is no need to do explicit prefetching. But if the data you are dealing with is much larger, which is often the case with SAT solvers, then such explicit pre-fetching can save quite some time.

On failed literal probing

Apparently, there have been quite some work done on failed literal probing, although I don’t think they have all been put into a common package yet.

The idea is quite simple in its purest form: try to set variable v1 to true and see if that fails. If it does, variable v1 must be false. If it doesn’t fail, try the other way around: set it to false and if that fails, it must be true.

There are a couple of tricks we can add, however. For example, if both v1 and !v1 set variable v10 to value X (where X can be either true or false), then set v10 to X. The reasoning here is simple: v1 must be set, so whatever both v1 and !v1 imply, it must be set, too. So we can safely set v10 to X.

A more interesting thinking is the following. If v1 sets v20 to true, but !v1 sets v20 to false, then v1 = v20. So, we can replace v20 with v1. One less variable to worry about!

There are even more tricks, however. If setting v1 to true and false both shorten a longer XOR to a 2-long XOR “v40 + v50 = false“, this 2-long XOR can be learnt: v40 can be replaced with v50.

And more tricks. If there is a 2-long clause v1 or v2 = true, then we can do all of the above, but with v1 and v2 this time. Since either v1 or v2 must be true, all the above ideas still work. In all previous ideas all we used was the fact that either v1 or !v1 must be true. This is still the case: either v1 or v2 must be true. This, by the way, is called 1-recursive learning. (note: the fourth paragraphs changes a bit, but we still learn a 2-long xor).

And now something new. I have been thinking about trying to re-use old learnt clauses. They really code down sometimes quite important facts about the problem. What if we tried to re-use them? After all, keeping them in memory is really cheap: the only problem with them is that doing regular propagation with them takes a lot of time, thus slowing down the solver. But what if we only used them to do failed literal probing? I just launched a test on the Grid’5000 cluster to find out. Fingers crossed…