Tag Archives: BCP

Visiting Linz

Lately I had the pleasure of visiting Linz, Armin Biere’s workplace, where I gave a quick talk on SAT solver architectures. To me, it was really interesting to think through that presentation — not because it was entirely new or exciting, but because it recapped on many issues that have been bothering me lately. Namely, that it’s difficult to make a really versatile SAT solver, because the low-level choices that must be made (e.g. watchlist scheme) determines so many things when one must make higher-level architectural decisions such as clause sharing or even something as simple as hyper-binary resolution. As for this latter, thanks to Armin Biere’s thoughts I have finally managed to realise why my hyper-binary resolution was so slow: I lately took the decision not to fully propagate binary clauses before propagating normal (i.e. larger) clauses, which meant that doing hyper-binary resolution was much slower as I had to re-calculate the binary graph. The fact of not fully propagating binary clauses before normal clauses seemed also to influence my much higher-level choice of using cached implications, as they (intuitively, and also in practice) help much more if binary clauses are not fully propagated before normal clauses. This latter influence is interesting to think through, as something this trivial shouldn’t — at least in principle — influence such a high-level decision.

Also thanks to Armin Biere, I have managed to grasp a better understanding of lingeling and its superior watchlist scheme. Some of the architectural elements of lingeling’s watchlist scheme are really interesting, and when they get published I will definitely port some of them to CryptoMiniSat. It seems to use much less space, and stores information in a more cache-friendly manner, aiding the processor in its job. A related, interesting pointer that I have learnt is this paper that originally introduced blocking literals, but also talks about a range of other ideas that can help. All in all, it was great to visit Linz and the group of Armin Biere, as I have managed to learn a lot from him and his colleagues.

Delayed updates

In this post I will try to explain how locality of reference could be exploited by grouping operations that access different memory locations. In the SAT solver program I write, there is a loop that does a something called propagation. The goal of propagation is to set a number of variables to a given value through a pretty cache-intensive process. While doing this, as the variables are set, some statistics are calculated. For example, I used to have something like this called for every variable set:

enqueueNewFact(Literal lit, PropagatedBy someData) {
    trail.push_back(lit.var());
    value[lit.var()] = lit.sign();
    agility *= conf.agilityG;
    if (varPolarity[lit.var()] != lit.sign()) agility += 0.0001;
    varPolarity[lit.var()] = lit.sign();
    varSet[lit.var()]++;
    level[lit.var()] = getLevel();
}

The problem with this function is that it is accessing a lot of data pieces: trail[], value[], level[], varPolarity[] and varSet[]. All of these must be in cache for these to be updated, and if they aren’t, they must be pulled in from lower levels of cache or the main memory. Since enqueuNewFact() is called often, these are mostly in cache all the time, taking the space away from what really needs it: the data used during the propagate() routine.

What I have just recently realised is that I don’t use the values of level[], varPolarity[], varSet[] or agility at all during the propagate() routine. Instead, they are used when the recursive calling of propagate() has finished, and a certain technical point (called a conflict) is reached. If this is indeed the case, why are we updating them all the time, immediately when enqueueNewFact() is called? If you look at the code, due to various reasons, we must save a trail (called, conveniently, trail[]) of what we have done. This trail contains all the information needed by all the statistics-update functions: var() and the sign(), both contained in Lit is all that’s necessary.

So, I have made a new function: delayedUpdate() that calculates all the statistics just when it is needed: when a conflict is reached. It has a simple state that saves at what trail position it was called last time, lastDelayedEnqueueUpdate. The default is trail level 0, and then whenever we reach a conflict, we update it, once the stats are calculated:

delayedEnqueueUpdate() {
    int pseudoLevel = lastDelayedEnqueueUpdateLevel;
    for (int i = lastDelayedEnqueueUpdate; i < trail.size(); i++) {
        Lit lit = trail[i];
        agility *= conf.agilityG;
        if (varPolarity[lit.var()] != lit.sign()) agility += 0.0001;
        varPolarity[lit.var()] = lit.sign();
        varSet[lit.var()]++;
        if (trail_lim[pseudoLevel] == i) pseudoLevel++;
        level[p.var()] = pseudoLevel;
    }
    lastDelayedEnqueueUpdate = trail.size();
    lastDelayedEnqueueUpdateLevel = decisionLevel();
}

This code, apart from the pseudoLevel and lastDelayedEnqueueUpdateLevel (both of which are SAT solver technicalities) is rather easy to understand. Basically, we take the last position we updated last time, lastDelayedEnqueueUpdate, and we work our way from there to the end, trail.size(). Once we have worked our way through, we set lastDelayedEnqueueUpdate to the current point, trail.size().

All this seems useless if we don't take cache usage into consideration: we are not saving any CPU instructions. If anything, we add CPU instructions to execute, as we now need to execute an extra for loop (propagate() is essentially a recursive for loop). However, from a cache usage point of view, we are saving a lot. When doing propagation, the propagate() routine can have all the cache it needs. Then, when that cache-intensive process has finished, we pull in a set of arrays, varPolarity[], varSet[], level[], update them all in one go, and then continue operation.

We have managed to group operations according to what memory pieces they are using&updating. This makes our program more cache-friendly, which almost always leads to a measurable speed increase.

On the New Watchlist Scheme

All right, so I am not so cool because I work on weekends, but I got this new watchlist scheme implemented. First off, let me say it was relatively easy to implement, which is a testament to the extensibility of CryptoMiniSat: it only took me ~4 hours to get it done. Secondly, I realised that I have forgotten XOR clauses when I thought through the new scheme. I used to store the current value of the XOR clause at the XOR clause itself (using the sings of the literals, being the hacker I am), and update it at every call to propXorClause(). Since we need to keep XOR clauses static just like normal clauses, this is no longer possible, so a separate data-structure needs to be built up for this. However, it won’t need too many updates, so all is fine, but it does make things a tad more complicated.

When first implemented, the new scheme performed worse than the normal one — much worse. We lost ~20% in terms of time. I never really give up on my ideas, however, so I started thinking which are the advantages of this new scheme, and how could we make them work for us? First of all, since the watchlists are now more expensive to go through, we might as well spend more time at the clause, looking for a watch that is not likely going to trigger again. In other words, when we are at the clause, and looking for a new watch, don’t just look for a literal that is either Undefined or True: strongly prefer to find one that is True. The idea here is that if we set the watch to be a literal that is True, during backtrack we might still have that literal be True (as we might not have undone that assignment), and so we won’t have to bother propagating it next time. This leads to more time going though the clause, but once the clause is already in the cache, going through it takes almost no time at all. This gained about ~5% of time: good, but not enough.

For the next idea, we have to notice that we are going through each clause from the beginning, and the literals there don’t change, in contrast to the scheme where we constantly move literals to the beginning of the clause. In almost every modern solver, we have something called polarity caching: we memorise what was the last polarity (True or False) of every variable, and then when making a branch next time, we take the cached polarity. In CryptoMiniSat there is a slight salt added, so with a low possibility we take the opposite branch, but still: the cached polarity is a pretty good indicator whether a literal will be True or False. If this is indeed the case, and we are preferably searching for literals that are True at the beginning of the clause… why not move the literals that have a True cached polarity to the beginning of the clause? So, I added a trivial (~20 line) algorithm that is executed very infrequently (~once per 100’000 confl.) to do this. An interesting property of the new watchlist scheme is that this can be done without detaching and reattaching the clauses. Once implemented, the time to solve decreased by ~13%. Overall, then, the speed of the new watchlist scheme is close to the original.

The advantages of this new sheme are multifold. Since we are only content when finding a True watched literal in the clause, we decrease the overall number of memory requests per thread, which should help the totality of threads make less memory requests, leading to less strain on the memory architecture. Not to mention that we really don’t touch the clauses, so write-back to memory is dramatically decreased, by probably 80% or more. And finally, we can share the clauses, leading to less overall memory fetches.

The only drawback of this new scheme seems to be that we cannot optimise for every thread: cached polarities are stored on a per-thread basis. However, one would expect that most of the time the cached polarities of the different threads are close. Therefore, this should work relatively well. Also, we could do voting for polarities among threads, and then sort literals in the clauses according to the votes cast — we would then optimise for the whole set of threads instead of only the master thread.

While doing the above, I have also played around with OProfile that is a Linux tool to tell (among other things) an approximation of the number of L2 cache misses by a program. This helped me put the right prefetch builtins at the right places, and the number of cache misses is now half of that without prefetch instructions (for both watchlist schemes). CryptoMiniSat 2.9.0 had some prefetch instructions added at the very last moment, but they weren’t placed optimally enough. The new way they are organised is much better, leaving more time for the CPU to prefetch the lines (which can take lots of cycles).

PS: Current architecture doesn’t allow CryptoMiniSat to share clauses, but even this way multi-threading is already faster by ~5% due to less memory write-back and less memory contention due to optimised watched literal picking.

Changing the Watchlist Scheme

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.

Caching

“Representation is the essence of programming”
Frederic P. Brooks Jr., “The Mythical Man-month”

Looking through the literature on SAT solvers, it is rare to find any algorithm that uses a form of time-memory trade-off. In fact, it is rare to find any algorithm that uses too much memory. This also shows up in practice, as it’s rare to find a SAT solver using too much memory. This translates to the following: if a SAT solver can in an effective way utilize the typically abundant memory, that SAT solver will have an advantage. Based on this logic, I think CryptoMiniSat is doing good.

What I have realised is that it’s great to know all the literals we can reach from any given literal. Let me explain. Let’s say we have a look at literal “v1”, i.e. “v1 = true”. We propagate this literal as usual, and we reach a set of literals that are implied. For example, we reach “v2, v3, -v4”. This knowledge is regularly computed by modern SAT solvers, but is also quickly thrown away. I advocate keeping this, and here is why. There are a number of very interesting things we can do with these, three of which I have found useful.

Number one. This is the best, and the least obvious one. The algorithm used for computing equivalent literals (i.e. “v1 = -v2” or “v1 = v4”) is a variation of Tarjan’s algorithm for finding strongly connected components (SCC). This algorithm requires a set of binary clauses such as “-v1 v2” and “v1 -v2” as input to find equivalent literals, in the example case “v1 = -v2”. The most obvious way to “feed” this algorithm with information is to give it all the binary clauses we have. But that misses out on all the binary clauses that are not directly apparent in the problem, but  could be derived from the problem. For example, if “-v1 -v4” was not directly in the problem, SCC cannot find the equivalence “v1 = v4”. Naturally, by using our cache, we can be sure that “-v1 -v4” is part of the problem (since “v1” propagates “-v4”). So, using this cache, we can substantially increase the power of SCC. More equivalent literals lead to less variables, and an overall faster solving.

Number two. Clause vivification is a method to make the original clauses of the problem shorter. It’s a relatively simple algorithm that enqueues the literals in the clauses one-by-one and checks if their propagation leads to a conflict. If so, then the clause can be shortened. With the cache, we can do something similar, though less powerful in terms of reasoning power, but far more effective in terms of speed. The trick is to simply try every single literal in the clause: if a literal propagates the inverse of another literal in the clause, we can remove it. For example, if the clause is “a b -c”, and in the cache for “b” there is “f,g,h,j,-c”, then we know that conceptually there exists a clause “-b -c”, which means we can now remove “b” from the original clause, using the the self-subsuming resolution rule. This form of vivification is, although technically less strong than normal vivification, is typically 50-100x faster than normal vivification and is almost as powerful. This kind of speed advantage means it can essentially be carried out without (time) penalty.

Number three. When generating conflict clauses, MiniSat-type conflict minimisation has now become commonplace. This uses the clauses involved in the conflict to carry out self-subsuming resolution on the conflict clause generated. Just as with clause vivification, we can use our cache to carry out self-subsuming resolution with the (conceptually binary) clauses stored in the cache. This is not much different from clause vivification, but it allows us to do simplification in the middle of a restart, instead of patiently waiting for the restart to end. Furthermore, it can uncover that certain learnt clauses can become binary, thus alleviating the problem of cleaning “useless” learnt clauses that could have become binary through, e.g. clause vivification.

I am aware that all the above three could be carried out without a cache — in fact, CryptoMiniSat could do (and most of the time did) all of the above without it. The advantage of having the cache is speed: things that used to take immense amounts of time can now be done really fast. However, what interests me the most in this cache, is the new uses that will come of it. I was originally only aware of number (3), then when I realised (1), I dived deep and realised that (2) can be done. I think more uses will eventually emerge.