Asymmetric branching

Asymmetric branching is an algorithm that shortens CNF clauses in SAT Solvers. A clause, for instance ` v1 V v2 V v3 V v4 ` (where letters are binary variables and `V` represents binary OR) could possibly be shortened to `v1 V v2 V v3`. To find out if it can be, all we have to do is to put `-v1,-v2` and `-v3` into the propagation queue and then propagate. If we receive a conflict from the propagation engine, we can learn the clause `v1 V v2 V v3`, which (incidentally, though this was the point), subsumes the original clause, so we can simply remove variable `v4` from the original clause.

Ok, so much for theory. Now comes the hard part: how do we do this such that it actually speeds up the solving? The problem is that asymmetric branching, when done on all possible clauses, is slow. However, its benefits could be large, since a shortened clause naturally leads a faster propagation and shorter resolution proofs thus less propagation need.

I have been experimenting in getting some benefit from asymmetric branching, and now it works extremely well for CryptoMiniSat. The trick I use, is that I first sort the clauses according to size, and only try to shorten with asymmetric branching the top couple of clauses. This ensures that the largest clauses are shortened first. Since the largest clauses contribute most the to size of the resolution proof and they are the slowest to propagate, this makes sense.

CryptoMiniSat tries to do asymmetric branching regularly, always for only a little while (~2 seconds). I believe this is useful, because as the amount of time the program has been trying to solve a problem increases, it makes sense that we have a bit more time to do things that could help resolve the problem faster. For instance, if CryptoMiniSat has been trying to solve a given problem for 30 minutes unsuccessfully with the standard clause-learning DPLL procedure in vain, we can allocate 2-3 seconds to possibly gain ~5-10% later. In the example case it can be assumed that since we haven’t been able to solve it for 30 minutes, probably we won’t solve it in the next 10 minutes, so gaining 5% on 10 minutes is 30 seconds, far more than the 2-3 seconds we invested.

The results with asymmetric branching with CryptoMiniSat are quite astounding. Using the 2009 SAT Competition benchmark set with an approximately correct timeout, CryptoMiniSat could normally solve 217 problems. With asymmetric branching, CryptoMiniSat can now solve 220 — a huge increase: last year, 16 solvers running in parallel could only solve 229 instances in total.

Edited to add (26/09/2010): Clause Vivification by Piette, Hamadi and Sais is a paper about the above described method, though with a number of key differences. The paper seems to advocate for a complete procedure, furthermore, it calls the conflict generation routine in certain cases. I believe that the above described way of carrying out this technique brings more tangible benefits, especially for larger problems.

(Updated: we need to branch on the inverse of the clause’s literals. Thanks to Vegard Nossum for spotting this)

Propagating faster

Propagation (BCP) in SAT is the single most expensive operation. The algorithm essentially does the following:

```while (queue_head < trail.size()) {
vector<Watch>& thiswatch = watches[p];
for (i = thiswatch.begin(); i!=thiswatch[p].end(); i++) {
confl = prop_confl(i->clause);
if (confl != NULL) return confl;
}
}```

where the function `prop_confl()` executes the propagation or returns a conflict, if necessary.

The problem with the above function is that clauses come in all shapes and sizes. In CryptoMiniSat, there are two kinds of clauses, normal clauses and XOR clauses. Naturally, these need to be handled differently. Furthermore, if the clause’s size is small, it is more efficient to handle it differently. The way CryptoMiniSat has done this before was to have a separate “`watches`” for each of these. As you can guess, this is very suboptimal.

The new code in CryptoMiniSat now uses a different approach. Instead of having a `watches_normal, watches_xor, watches_binary`, and a `for` loop for each, we have one `watches` and one `for` loop, but this time, the struct `Watch` encodes the type of the watched clause. So our new BCP function is:

```while (queue_head < trail.size()) {
vector<Watch>& thiswatch = watches[p];
for (i = thiswatch.begin(); i!=thiswatch.end(); i++) {
if (i->isBinary())
confl = prop_confl_bin(i->clause);
if (i->isNormal())
confl = prop_confl_norm(i->clause);
if (i->isXor())
confl = prop_confl_xor(i->clause);
if (confl != NULL) return confl;
}
}```

This new method reduces the jump-around that the processor otherwise would have done with all those separate for loops. It has an important drawback, though: it is now more complicated (and more expensive) to fully propagate binary clauses before any other clause. The original code had this:

```while (queue_head < trail.size()) {
vector<Watch>& thiswatch_bin = watches_bin[p];
for (i = thiswatch_bin.begin(); i!=thiswatch_bin.end(); i++) {
confl = prop_confl_bin(i->clause);
if (confl != NULL) return confl;
}
}
vector<Watch>& thiswatch = watches[p];
for (i = thiswatch.begin(); i!=thiswatch.end(); i++) {
confl = prop_confl(i->clause);
if (confl != NULL) return confl;
}
}```

Since there are no separate watchlists for binary clauses anymore, we are required to go through all clauses all the time in the first for loop, or we can simply skip the first loop. The disadvantage of this is that some variables could have possibly been set by binary clauses may now be set by non-binary clauses, which leads to a possibly increased number of literals in the resulting learnt clause — since the incoming arcs into the conflict will likely contain more clauses that are larger than 2-long.

Optimisations, take two

I have talked about optimisations before and I mentioned that there are a few left to describe. In this post I will talk about two of them.

The first optimisation I would like to talk about concerns compilation flags. Somehow, everyone seems obsessed with them, like if they could somehow make a slow program fast. Alas, this is not the case. Firstly, and this is quite funny: the gcc flag “-O3” usually turns on all extra flags that people tend to give. One should really ponder upon this when looking at desperate attempts to speed up code. However, there is exactly one flag-combo that is very-very useful besides “-O3”: it’s “-fprofile-generate” and “-fprofile-use”. To understand why these are useful, we must first understand that one of the challenges faced by an optimising compiler is to try to guess how many times a loop will be executed,and how many times a branch will be taken. Given a good set of guesses, the loop can be unwound (or not) and the branches can be taken by default (or not). If compiled with “-fprofile-generate”, the program generates such information on-the-fly, which later can be used by “-fprofile-use”. The speedup achieved with such an approach in the realms of DPLL-based SAT solvers is relatively high, in the order of ~5-10%. I believe many SAT solvers’ binaries don’t use this trick, even though it’s cheap to use. I personally compile with “-fprofile-generate”, then run an example problem like UTI-20-10p1 on it, and then recompile with “-fprofile-use”. An important aspect, by the way, is to execute your problem with a very typical scenario: strange or non-normal scenarios will produce spurious branching and loop data which will actually slow down the program most of the time instead of speeding it up.

The second optimisation I would like to talk about concerns cache usage. Modern processors are extremely fast, to the point that many times what really limits the processor is no longer the ability to execute the code on the data, but to actually get the data that the code needs. This problem is usually alleviated through spatial and temporal data locality that is, that data most likely needed next is usually physically close to the one that we just accessed. However, in DPLL-based SAT solvers, the main function, propagate() goes through a list of pointers, accessing the data where each pointer points to. In other words, the CPU will fetch the next pointer in the background, but will not fetch where that pointer points to — simply because it doesn’t know it’s a pointer. So, how could we alleviate this problem? Well, by telling the CPU that the next piece of information that will be needed is where the next pointer points to. Like this:

```for (i = 0; i < pointers.size();  i++) {
if (i+1 < pointers.size())
__builtin_prefetch(pointers[i+1]);
treat_data(*pointers[i]);
}```

where `pointers` is a vector of pointers that must each be treated. The function `__builtin_prefetch()` calls an assembly instruction that non-blockingly fetches the data pointed to into the Level 1 cache of the processor. This/these instruction(s) are not implemented in all CPUs, but many x86 and all AMD64 architectures have them. Essentially, while we are dealing with data where `pointers[i]` points at, we fetch the data where `pointers[i+1]` points at, in parallel. Interleaving the fetching of next data with the treatment of current data helps us do more things in less time, speeding up the program. Interestingly, since prefetching the data does not alter the flow of the program in any way, omitting the prefetch call does not alter the results of the program at all: it just makes it run slower. In CryptoMiniSat I use this optimisation, and it seems to speed up the solving by a couple of percentages, maybe 5-8%. The amount of time you will save on this depends on the locality and size of the information your pointers are pointing to. If they are very small, e.g. 2-3MB in total and are close to each other, then the CPU will fetch these after the first couple of pointers have been dereferenced, so there is no need to do explicit prefetching. But if the data you are dealing with is much larger, which is often the case with SAT solvers, then such explicit pre-fetching can save quite some time.

Optimisations

There are plenty of optimisations that I have tried with CryptoMiniSat. I will try to list a few that I finally didn’t consider using, but could be implemented later to get around 4-8% speedup each.

It seems a good idea to use a separate CNF simplifier such as SatELite before executing the SAT solver. The reasoning is, that even though one can import (or copy-paste) the code from SatELite, the memory will stay fragmented: variable elimination might remove half the variables, and so the arrays that store variable values will have unused bytes in them, using a larger footprint than necessary, thus causing cache misses. Though this is true, SatELite is not exactly perfect. For instance, it can be seen in certain instances of the SAT Competition of 2009 that GLUCOSE sometimes had very bad performance in comparison with other solvers exactly because SatELite took, e.g. 130 seconds, but the solving only took 2 seconds. So SatELite slowed down the solving of this instance. Also, SatELite has some limitations, e.g. it cannot handle more than 4’800’000 clauses otherwise it might take just too much time. These are huge drawbacks, and in fact CryptoMiniSat (and, e.g. PrecoSat) overcomes all of these. The memory fragmentation issue remains, though. It could be treated, but that would require about 2-3 weeks of time to get corrected, and I simply don’t have that much time. P.S.: Easy solution: rename the variables everywhere. Hard (but less error-prone) solution: use different datastructure for variable values.

Another possible optimisation that never made it to CryptoMiniSat (though I have a branch in git that achieves it more-or-less) is the close packing of clauses in memory. It can greatly increase the cache-hit ratio, though only for relatively small (<20-40 MByte) problems. The reason why I never got this done is that it is a source of a lot of potential problems. Firstly, there is no point in packing clause sizes that are rare. The most common clause size is 2 (see SAT Competition instances), so it’s only logical to pack these clauses. However, this presents a problem: how should the clause be freed after usage? If it was a packed clause, it needs to be freed specially (from the packed heap). But sometimes, non-packed clauses shrink to size 2, so they should become packed. Must these clauses be freed, and then re-allocated from the packed heap? Doesn’t that add too much complexity? Also, according to my experience, packed heaps only help when the problem is small. For large problems, the cache miss rate will be very high anyway, and so packing won’t make much of a difference.

There are some other optimisations that I missed out on, I think. I will one day look through my very messy git tree (with 41 branches and approx. 1600 commits) to see which ones did I once implement and never got working :)

PS: After the start of the SAT Race 2010, I will put all of the git tree up into gitorious. Then, everyone can use it and try to merge these trees into the main tree, and see what happens.