Changing the Watchlist Scheme | Wonderings of a SAT geek

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.