Tag Archives: Parallel SAT Solving

Dreaming up a parallel SAT architecture

When someone looks at the speed of solving when using parallel SAT solvers, it is easy to see that scaling doesn’t really work — especially for the hard UNSAT problems. In the past months I have been struggling to think of an architecture that will enable me to write a SAT solver that solves this problem. Let me share with you where I got. It will be more of a brainstorming than a descriptive blog post, but maybe it will help you avoid pitfalls, or maybe just to get a better picture of parallel SAT solving.

The constraints:

  • Clauses must be shared… all the time. Not “when it’s good enough” as per ManySat — that architecture doesn’t work well for UNSAT problems. The performance of a fully shared scheme should be able to overtake a ManySat-like architecture with ease. Only sharing low-glue or high-activity clauses does not work. This is easy to test: write a single-threaded SAT solver with high-glue learnt clause detaching on backtracking. It will work decidedly worse, even though you would clean those clauses anyway when doing clause cleaning. The reason? I don’t know… probably we need those clauses in our current local (=var activities similar) search.
  • Clauses need to have glues and activities. Glues could be dynamically updated, but the gain is not that much and the burden is a bit too much — so make that static. Activities (which are inherently dynamic) are important though (see CryptoMiniSat), and they must be lock-free as they would need to be updated frequently.
  • Memory is scarce when implication caching is used. Implication cache must be shared and updated across the parallel solvers. This not only lowers memory usage, but is an important way to share information other than clauses.
  • Parallel solvers must be lean otherwise the burden of programming a correct solver would be far too much
  • Simplifications must be carried out regularly, which needs the coordinated stopping / interruption of all parallel solvers

My architectural solution:

  • Make a lean class that only handles: clause attachment, detachment and propagation. I will need to add on-the-fly hyper-binary resolution and on-the-fly useless binary clause removal as well, as those are tightly coupled to propagation
  • Make a class that handles search&conflict generation. This class has the lean class as parent. This generates conflict clauses and has its own variable & clause activities. The class checks for new conflict clauses form other solvers after every conflict it generates, and attaches those clauses when that doesn’t need trail-rewriting — though this latter could be made optional. This class does not store a list of clauses. The tricky part of this class is on-the-fly subsumption. We will handle that by not shortening the clause but instead adding a new one and detaching the old one. The thread-coordindating class will handle the propagation of this update to all other threads. Checks for interruption are done after every generated conflict.
  • A thread-coordinating class that handles clause storage, clause distribution, and simplification. This class stops all threads if UNSAT or SAT is found by one thread, and interrupts them if clause-cleaning needs to be carried out. This class also has the lean class as parent, since e.g. clause vivification or variable-replacement need propagation.

Data structures:

  • Fully shared clauses. No move-to-front strategy, instead literal-sorting is carried out once in a while (during simplification I would assume). This has already been implemented, and is only about 10% slower than non-shared clauses in single-threaded mode.
  • Interrupt handler. This is just a flag set by the coordinating solver. The subsolvers check this flag after every conflict generated, and either exit (if SAT/UNSAT is found) or just stop and mark each learnt clause locked (if cleaning the learnt clause database)
  • Implication cache. Needs read-write lock, as most of the time it will be read to do on-the-fly self-subsuming resolution with non-existent binary clauses. Can only have literals added, not removed. Removal and updating of literals is performed only during simplification. Updating (i.e. writing) is done by threads only when doing propagation at decision level 1.
  • Distribution of newly learnt clauses. A ‘lock-free’ (how I don’t like this — it’s misleading, it has locks) datastructure for each thread in the thread-coordinating class that allows the storage of a pre-fixed number of clauses, accessible & writeable without locks until the buffer is filled (at which point we lock & update pointers).

Dilemmas:

  • Maybe thread-management while searching (i.e. learnt clause cleaning and sharing) should be a different class from the one that performs simplification. The former would not need to have propagation at all. Separation of duties always leads to cleaner code.
  • Single-threaded solving is now only an afterthought, and it would be interesting to see how the solver performs in single-threaded mode. I hope it will be only slightly worse than if it didn’t have this architecture around it. It is interesting to note, though, that the structure of the solver will be much cleaner than huge-blob single-threaded solvers tend to be.
  • OpenMP will probably not suffice, as read-write locks are not available, and implication cache sharing is a very important part of the architecture. Fully locking that datastructure when using it would incur too much overhead — most (99.9%) of the time we will be reading it. This means the code might not be as portable as I wished it would be. I think the read-write lock is the only thing I am missing to implement all of this using OpenMP.

Notes:

  • This architecture lets the designer easily change the propagation mechanism. Just re-write the lean class. I have actually done that, and I can have both move-to-front (does not work with sharing) and static clause propagation strategies. People could write their own with ease.
  • Conflict generation&search class should have an object pointer passed to it that is an instance of a class that has a abstract search class as its parent. We could then change restart settings on-the-fly by passing different object pointers.
  • We could save state (i.e. stop&restart search) easily by interrupting all threads and saving the clause database (+learnt clause cleaning state). Simplifications should be fully stateless, so no state would need to be saved for those.

Okay, this is as far as I got. I have the lean class and the search&conflict generation class, the shared clause architecture, and the stateless simplifications ready, but the coordinating class is currently highly entangled with the conflict generation class, so those need to be separated. Further, I want to fix on-the-fly subsumption because it seems to be an easy way to improve the efficiency of conflict generation: if we can shorten a non-learnt clause at every 10th conflict, that is essentially a surely good conflict clause that will never be cleaned. The effect of not having to clear clauses is important when e.g. 6x more conflicts (for a 6-core CPU) are being generated per second than with a single-threaded solver.

On the New Watchlist Scheme

All right, so I am not so cool because I work on weekends, but I got this new watchlist scheme implemented. First off, let me say it was relatively easy to implement, which is a testament to the extensibility of CryptoMiniSat: it only took me ~4 hours to get it done. Secondly, I realised that I have forgotten XOR clauses when I thought through the new scheme. I used to store the current value of the XOR clause at the XOR clause itself (using the sings of the literals, being the hacker I am), and update it at every call to propXorClause(). Since we need to keep XOR clauses static just like normal clauses, this is no longer possible, so a separate data-structure needs to be built up for this. However, it won’t need too many updates, so all is fine, but it does make things a tad more complicated.

When first implemented, the new scheme performed worse than the normal one — much worse. We lost ~20% in terms of time. I never really give up on my ideas, however, so I started thinking which are the advantages of this new scheme, and how could we make them work for us? First of all, since the watchlists are now more expensive to go through, we might as well spend more time at the clause, looking for a watch that is not likely going to trigger again. In other words, when we are at the clause, and looking for a new watch, don’t just look for a literal that is either Undefined or True: strongly prefer to find one that is True. The idea here is that if we set the watch to be a literal that is True, during backtrack we might still have that literal be True (as we might not have undone that assignment), and so we won’t have to bother propagating it next time. This leads to more time going though the clause, but once the clause is already in the cache, going through it takes almost no time at all. This gained about ~5% of time: good, but not enough.

For the next idea, we have to notice that we are going through each clause from the beginning, and the literals there don’t change, in contrast to the scheme where we constantly move literals to the beginning of the clause. In almost every modern solver, we have something called polarity caching: we memorise what was the last polarity (True or False) of every variable, and then when making a branch next time, we take the cached polarity. In CryptoMiniSat there is a slight salt added, so with a low possibility we take the opposite branch, but still: the cached polarity is a pretty good indicator whether a literal will be True or False. If this is indeed the case, and we are preferably searching for literals that are True at the beginning of the clause… why not move the literals that have a True cached polarity to the beginning of the clause? So, I added a trivial (~20 line) algorithm that is executed very infrequently (~once per 100’000 confl.) to do this. An interesting property of the new watchlist scheme is that this can be done without detaching and reattaching the clauses. Once implemented, the time to solve decreased by ~13%. Overall, then, the speed of the new watchlist scheme is close to the original.

The advantages of this new sheme are multifold. Since we are only content when finding a True watched literal in the clause, we decrease the overall number of memory requests per thread, which should help the totality of threads make less memory requests, leading to less strain on the memory architecture. Not to mention that we really don’t touch the clauses, so write-back to memory is dramatically decreased, by probably 80% or more. And finally, we can share the clauses, leading to less overall memory fetches.

The only drawback of this new scheme seems to be that we cannot optimise for every thread: cached polarities are stored on a per-thread basis. However, one would expect that most of the time the cached polarities of the different threads are close. Therefore, this should work relatively well. Also, we could do voting for polarities among threads, and then sort literals in the clauses according to the votes cast — we would then optimise for the whole set of threads instead of only the master thread.

While doing the above, I have also played around with OProfile that is a Linux tool to tell (among other things) an approximation of the number of L2 cache misses by a program. This helped me put the right prefetch builtins at the right places, and the number of cache misses is now half of that without prefetch instructions (for both watchlist schemes). CryptoMiniSat 2.9.0 had some prefetch instructions added at the very last moment, but they weren’t placed optimally enough. The new way they are organised is much better, leaving more time for the CPU to prefetch the lines (which can take lots of cycles).

PS: Current architecture doesn’t allow CryptoMiniSat to share clauses, but even this way multi-threading is already faster by ~5% due to less memory write-back and less memory contention due to optimised watched literal picking.

Changing the Watchlist Scheme

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.