Tag Archives: memory layout

On the New Watchlist Scheme

All right, so I am not so cool because I work on weekends, but I got this new watchlist scheme implemented. First off, let me say it was relatively easy to implement, which is a testament to the extensibility of CryptoMiniSat: it only took me ~4 hours to get it done. Secondly, I realised that I have forgotten XOR clauses when I thought through the new scheme. I used to store the current value of the XOR clause at the XOR clause itself (using the sings of the literals, being the hacker I am), and update it at every call to propXorClause(). Since we need to keep XOR clauses static just like normal clauses, this is no longer possible, so a separate data-structure needs to be built up for this. However, it won’t need too many updates, so all is fine, but it does make things a tad more complicated.

When first implemented, the new scheme performed worse than the normal one — much worse. We lost ~20% in terms of time. I never really give up on my ideas, however, so I started thinking which are the advantages of this new scheme, and how could we make them work for us? First of all, since the watchlists are now more expensive to go through, we might as well spend more time at the clause, looking for a watch that is not likely going to trigger again. In other words, when we are at the clause, and looking for a new watch, don’t just look for a literal that is either Undefined or True: strongly prefer to find one that is True. The idea here is that if we set the watch to be a literal that is True, during backtrack we might still have that literal be True (as we might not have undone that assignment), and so we won’t have to bother propagating it next time. This leads to more time going though the clause, but once the clause is already in the cache, going through it takes almost no time at all. This gained about ~5% of time: good, but not enough.

For the next idea, we have to notice that we are going through each clause from the beginning, and the literals there don’t change, in contrast to the scheme where we constantly move literals to the beginning of the clause. In almost every modern solver, we have something called polarity caching: we memorise what was the last polarity (True or False) of every variable, and then when making a branch next time, we take the cached polarity. In CryptoMiniSat there is a slight salt added, so with a low possibility we take the opposite branch, but still: the cached polarity is a pretty good indicator whether a literal will be True or False. If this is indeed the case, and we are preferably searching for literals that are True at the beginning of the clause… why not move the literals that have a True cached polarity to the beginning of the clause? So, I added a trivial (~20 line) algorithm that is executed very infrequently (~once per 100’000 confl.) to do this. An interesting property of the new watchlist scheme is that this can be done without detaching and reattaching the clauses. Once implemented, the time to solve decreased by ~13%. Overall, then, the speed of the new watchlist scheme is close to the original.

The advantages of this new sheme are multifold. Since we are only content when finding a True watched literal in the clause, we decrease the overall number of memory requests per thread, which should help the totality of threads make less memory requests, leading to less strain on the memory architecture. Not to mention that we really don’t touch the clauses, so write-back to memory is dramatically decreased, by probably 80% or more. And finally, we can share the clauses, leading to less overall memory fetches.

The only drawback of this new scheme seems to be that we cannot optimise for every thread: cached polarities are stored on a per-thread basis. However, one would expect that most of the time the cached polarities of the different threads are close. Therefore, this should work relatively well. Also, we could do voting for polarities among threads, and then sort literals in the clauses according to the votes cast — we would then optimise for the whole set of threads instead of only the master thread.

While doing the above, I have also played around with OProfile that is a Linux tool to tell (among other things) an approximation of the number of L2 cache misses by a program. This helped me put the right prefetch builtins at the right places, and the number of cache misses is now half of that without prefetch instructions (for both watchlist schemes). CryptoMiniSat 2.9.0 had some prefetch instructions added at the very last moment, but they weren’t placed optimally enough. The new way they are organised is much better, leaving more time for the CPU to prefetch the lines (which can take lots of cycles).

PS: Current architecture doesn’t allow CryptoMiniSat to share clauses, but even this way multi-threading is already faster by ~5% due to less memory write-back and less memory contention due to optimised watched literal picking.

Changing the Watchlist Scheme

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.

How to bit-stuff pointers

In my last post I have described a small problem I had regarding a C++ union that had to store a pointer and a value at the same time, plus an indicator that told us which of the two was active:

Martin Maurer has wrote this enlightening email regarding this:

The Clause * pointer is [..] located at least on a 4 byte boundary (at least when not forced by e.g. #pragma). If it is not at least on a 4 byte boundary, CPU must (at least some of them) do two accesses: one lower DWORD, one upper DWORD, and merge both together, which reduces speed.

So the lower two bits are always zero :-) Why not move the “wasPointer” to lower two bits (i think you need only one, have one spare bit). Of course before accessing the “pointer” (Clause *), you must “and” the two lower bits back to zero.

He is perfectly right: the pointer will never be zero on the least significant bit (LSB), so setting that to “1” can be used as an indicator. This is a form of bit-stuffing. To check if it’s a pointer or a literal the following function can be used:

As for storing lit, we need to right-shift it by one and set the LSB to 1. When lit is needed, we left-shift it to get back the original value. This is not a problem, since variables in SAT never use out the 32-bit space: there are rarely more than 10 million variables, so right-shifting this value will not lead to information loss.

Propagating binary clauses

Binary clauses are somewhat special. They are clauses that look like:

-v1 or v20 = true
-v1 or v21 = true

which basically mean: if v1 is TRUE, then both v20 and v21 must be TRUE. We put these clauses into a special datastructure, that looks like this:

watchlist[-v1] = v20,v21, ...

This makes us all very happy, since this datastructure is very simple: just literals (negated or non-negated variables) one after the other, in a very trivial list. This makes the whole thing very compact, and very efficient.

The problem with such a list is that when v1 is indeed set to TRUE, then when we set the variables (e.g. v20,v21), we must record what set these variables to their new values. What usually is done is to simply record the pointer to the clause that does this. However, in the above datastructure, there are no clause pointers. The datastructure for GLUCOSE contained these pointers. I have lately been experimenting with removing these pointers. The tricky part is to update the conflict generation routine that examines the clauses and decisions that lead to a conflict. This routine must now automatically recognise that these binary clauses are special, and they are only 2-long: one part of them is the variable that got set, and the other part is the variable that set it. These two informations are available: watchlist[-v1] immediately tells us that literal -v1 set the variable, and the variable that got set is always known.

Armed with this logic, one would think that moving towards this all-new no-clause-pointers heaven is the way to go. However, apparently, this might not be so. This is quite astonishing, given that doing things without pointers should mean less cache-misses (since pointers are less likely to be resolved during conflict analysis). I consistently get worse speeds, though. I am guessing that the problem lies with the datastructure I use to store the information stating what lead to the setting of the variable. I use a very simple struct:

where wasPointer is TRUE if a non-binary clause lead here (and we store the pointer), and FALSE if it was a binary clause (and we store the other literal). The union takes care of the rest: pointer is valid when wasPointer is TRUE and lit is valid when wasPointer is FALSE.

What is apparent in this datastructure is the following: we now store a 32/64 bit pointer plus we store a boolean. Right, so 42/72 bits of data? No. Data alignment in C++ means, that this datastructure will be aligned according to its largest member, so it will be aligned to 32/64-bit boundaries: this datastructure takes 128/64 bits to store on 32 and 64 bit architectures, respectively. Oops, we just doubled our demand of data writes!

There are some remedies, of course. We can simply pack:

which removes the memory burden of aligning to maximum size, and the actual size will really be 42/72 bits. But this reaises the overhead of accessing data inside this structure. Furthermore, I have been using a 32-bit mini-pointer on 64-bit architectures, which is a big hack, so it crumbles as soon as something as complicated as this comes up: so I am left with writing 72 bits instead of 32. This probably leads to the slowdowns. Oh well, I have got to work harder…

EDITED TO ADD: Next up, I will talk about some new ideas that let me solve 217 problems from the 2009 SAT Race within the original time limit. Remember that last year the best solver only solved 204.

Optimisations

There are plenty of optimisations that I have tried with CryptoMiniSat. I will try to list a few that I finally didn’t consider using, but could be implemented later to get around 4-8% speedup each.

It seems a good idea to use a separate CNF simplifier such as SatELite before executing the SAT solver. The reasoning is, that even though one can import (or copy-paste) the code from SatELite, the memory will stay fragmented: variable elimination might remove half the variables, and so the arrays that store variable values will have unused bytes in them, using a larger footprint than necessary, thus causing cache misses. Though this is true, SatELite is not exactly perfect. For instance, it can be seen in certain instances of the SAT Competition of 2009 that GLUCOSE sometimes had very bad performance in comparison with other solvers exactly because SatELite took, e.g. 130 seconds, but the solving only took 2 seconds. So SatELite slowed down the solving of this instance. Also, SatELite has some limitations, e.g. it cannot handle more than 4’800’000 clauses otherwise it might take just too much time. These are huge drawbacks, and in fact CryptoMiniSat (and, e.g. PrecoSat) overcomes all of these. The memory fragmentation issue remains, though. It could be treated, but that would require about 2-3 weeks of time to get corrected, and I simply don’t have that much time. P.S.: Easy solution: rename the variables everywhere. Hard (but less error-prone) solution: use different datastructure for variable values.

Another possible optimisation that never made it to CryptoMiniSat (though I have a branch in git that achieves it more-or-less) is the close packing of clauses in memory. It can greatly increase the cache-hit ratio, though only for relatively small (<20-40 MByte) problems. The reason why I never got this done is that it is a source of a lot of potential problems. Firstly, there is no point in packing clause sizes that are rare. The most common clause size is 2 (see SAT Competition instances), so it’s only logical to pack these clauses. However, this presents a problem: how should the clause be freed after usage? If it was a packed clause, it needs to be freed specially (from the packed heap). But sometimes, non-packed clauses shrink to size 2, so they should become packed. Must these clauses be freed, and then re-allocated from the packed heap? Doesn’t that add too much complexity? Also, according to my experience, packed heaps only help when the problem is small. For large problems, the cache miss rate will be very high anyway, and so packing won’t make much of a difference.

There are some other optimisations that I missed out on, I think. I will one day look through my very messy git tree (with 41 branches and approx. 1600 commits) to see which ones did I once implement and never got working :)

PS: After the start of the SAT Race 2010, I will put all of the git tree up into gitorious. Then, everyone can use it and try to merge these trees into the main tree, and see what happens.