Tag Archives: programming

Dreaming up a parallel SAT architecture

When someone looks at the speed of solving when using parallel SAT solvers, it is easy to see that scaling doesn’t really work — especially for the hard UNSAT problems. In the past months I have been struggling to think of an architecture that will enable me to write a SAT solver that solves this problem. Let me share with you where I got. It will be more of a brainstorming than a descriptive blog post, but maybe it will help you avoid pitfalls, or maybe just to get a better picture of parallel SAT solving.

The constraints:

  • Clauses must be shared… all the time. Not “when it’s good enough” as per ManySat — that architecture doesn’t work well for UNSAT problems. The performance of a fully shared scheme should be able to overtake a ManySat-like architecture with ease. Only sharing low-glue or high-activity clauses does not work. This is easy to test: write a single-threaded SAT solver with high-glue learnt clause detaching on backtracking. It will work decidedly worse, even though you would clean those clauses anyway when doing clause cleaning. The reason? I don’t know… probably we need those clauses in our current local (=var activities similar) search.
  • Clauses need to have glues and activities. Glues could be dynamically updated, but the gain is not that much and the burden is a bit too much — so make that static. Activities (which are inherently dynamic) are important though (see CryptoMiniSat), and they must be lock-free as they would need to be updated frequently.
  • Memory is scarce when implication caching is used. Implication cache must be shared and updated across the parallel solvers. This not only lowers memory usage, but is an important way to share information other than clauses.
  • Parallel solvers must be lean otherwise the burden of programming a correct solver would be far too much
  • Simplifications must be carried out regularly, which needs the coordinated stopping / interruption of all parallel solvers

My architectural solution:

  • Make a lean class that only handles: clause attachment, detachment and propagation. I will need to add on-the-fly hyper-binary resolution and on-the-fly useless binary clause removal as well, as those are tightly coupled to propagation
  • Make a class that handles search&conflict generation. This class has the lean class as parent. This generates conflict clauses and has its own variable & clause activities. The class checks for new conflict clauses form other solvers after every conflict it generates, and attaches those clauses when that doesn’t need trail-rewriting — though this latter could be made optional. This class does not store a list of clauses. The tricky part of this class is on-the-fly subsumption. We will handle that by not shortening the clause but instead adding a new one and detaching the old one. The thread-coordindating class will handle the propagation of this update to all other threads. Checks for interruption are done after every generated conflict.
  • A thread-coordinating class that handles clause storage, clause distribution, and simplification. This class stops all threads if UNSAT or SAT is found by one thread, and interrupts them if clause-cleaning needs to be carried out. This class also has the lean class as parent, since e.g. clause vivification or variable-replacement need propagation.

Data structures:

  • Fully shared clauses. No move-to-front strategy, instead literal-sorting is carried out once in a while (during simplification I would assume). This has already been implemented, and is only about 10% slower than non-shared clauses in single-threaded mode.
  • Interrupt handler. This is just a flag set by the coordinating solver. The subsolvers check this flag after every conflict generated, and either exit (if SAT/UNSAT is found) or just stop and mark each learnt clause locked (if cleaning the learnt clause database)
  • Implication cache. Needs read-write lock, as most of the time it will be read to do on-the-fly self-subsuming resolution with non-existent binary clauses. Can only have literals added, not removed. Removal and updating of literals is performed only during simplification. Updating (i.e. writing) is done by threads only when doing propagation at decision level 1.
  • Distribution of newly learnt clauses. A ‘lock-free’ (how I don’t like this — it’s misleading, it has locks) datastructure for each thread in the thread-coordinating class that allows the storage of a pre-fixed number of clauses, accessible & writeable without locks until the buffer is filled (at which point we lock & update pointers).

Dilemmas:

  • Maybe thread-management while searching (i.e. learnt clause cleaning and sharing) should be a different class from the one that performs simplification. The former would not need to have propagation at all. Separation of duties always leads to cleaner code.
  • Single-threaded solving is now only an afterthought, and it would be interesting to see how the solver performs in single-threaded mode. I hope it will be only slightly worse than if it didn’t have this architecture around it. It is interesting to note, though, that the structure of the solver will be much cleaner than huge-blob single-threaded solvers tend to be.
  • OpenMP will probably not suffice, as read-write locks are not available, and implication cache sharing is a very important part of the architecture. Fully locking that datastructure when using it would incur too much overhead — most (99.9%) of the time we will be reading it. This means the code might not be as portable as I wished it would be. I think the read-write lock is the only thing I am missing to implement all of this using OpenMP.

Notes:

  • This architecture lets the designer easily change the propagation mechanism. Just re-write the lean class. I have actually done that, and I can have both move-to-front (does not work with sharing) and static clause propagation strategies. People could write their own with ease.
  • Conflict generation&search class should have an object pointer passed to it that is an instance of a class that has a abstract search class as its parent. We could then change restart settings on-the-fly by passing different object pointers.
  • We could save state (i.e. stop&restart search) easily by interrupting all threads and saving the clause database (+learnt clause cleaning state). Simplifications should be fully stateless, so no state would need to be saved for those.

Okay, this is as far as I got. I have the lean class and the search&conflict generation class, the shared clause architecture, and the stateless simplifications ready, but the coordinating class is currently highly entangled with the conflict generation class, so those need to be separated. Further, I want to fix on-the-fly subsumption because it seems to be an easy way to improve the efficiency of conflict generation: if we can shorten a non-learnt clause at every 10th conflict, that is essentially a surely good conflict clause that will never be cleaned. The effect of not having to clear clauses is important when e.g. 6x more conflicts (for a 6-core CPU) are being generated per second than with a single-threaded solver.

Hyper-binary resolution: I was wrong, again

I am not perfect, so I make mistakes. One of the more memorable mistakes I have made has been regarding hyper-binary resolution. More specifically, in this post I wrote that I cannot add as many binary clauses using hyper-binary resolution as Armin Biere, one of the leading SAT solver experts, and I proposed a reason that would have meant that CryptoMiniSat was doing things differently and thus better. Well, I was wrong, on multiple accounts.

First of all, there was an awful bug in the code that meant that hyper-binary resolution was not carried out on negated variables. Second, when it was decided that multiple implications need to be attached to a literal, I didn’t check for redundancy. For example, if v1 had to be connected to v2 and v3, I simply added the clauses
-v1 OR v2 (1)
-v1 OR v3 (2)
However, this may not be very efficient, since if there is already a binary clause
-v2 OR v3 (3)
then clause (2) doesn’t need to be added. The added redundant binary clauses reduced the speed of solving while adding no extra knowledge. There were many of them, too: approx 2/3rd of all binary clauses added were redundant.

Hyper-binary resolution is conceptually not too difficult, but takes a lot of thinking to code efficiently, is very time consuming and its benefits are not clear-cut. I believe the problem is that many binary clauses added this way are ultimately useless, since most are connected to variables that will soon be proved to be of a fixed value. Another possibility is that since problems are pretty structured, and it’s usually best to attack problems in a specific way (which is normally correctly guessed by the VSIDS and polarity-guessing&caching heuristics), the binary clauses added by hyper-binary resolution do not help resolving the problem given the (typically correct) attack method employed by SAT solvers. In other words, the information added is nice, but mostly unused. This is just wild speculation, and I may only think this because my code is slow — I believe Armin’s code is faster, so I should have a look.

Wasting cluster time

Today marked the 100th cluster experiment I have conducted on the Grid’5000 cluster in Grenoble. The computers in this cluster are very powerful beasts, equipped with two Xeon Core i7-based processors, each with 4 cores for 8 cores in total with some impressive memory size and connectivity. The sad part is that since CryptoMiniSat is sequential software, I only use 1 core from the 8 total, since I cannot allow to have fluctuating test results due to memory-bandwidth problems when running multiple instances of the solver on the same physical computer. I have just counted, and using a very conservative estimate, I have wasted 41.8 years of  CPU core time to get CryptoMiniSat up to speed.

Was it worth wasting such immense amount of CPU core time to speed up a SAT solver? I think yes. First of all, I only used the cluster during night and during the weekends, when load was negligible. So, in a way I wasn’t wasting much apart from my own time launching the cluster experiments, collecting and analysing the results. Secondly, the feedback from the cluster immensely influenced my research and engineering decisions, and taught me many things regarding SAT. The best thing I have learnt is: never trust your intuitions when it comes to SAT. Secondly, good ideas on paper rarely turn out to be good ideas when first implemented, but with heavy tuning, they usually turn out to work well. Let me detail two examples.

Clause Vivification (I called it asymmetric branching) is a relatively simple example of why tuning is always necessary. This optimisation shortens clauses it is fed to, however, it is slow. The idea I used when tuning this optimisation is that the longest clauses are probably responsible for the slow solving and the long generated conflict clauses. So, I sort clauses according to size first, and then do clause vivification on the topmost (i.e. longest) clauses. Furthermore, I realised that enqueueing literals 2-by-2 on the stack (instead of enqueueing 1-by-1) saves significant time, and is almost as good as enqueueing them 1-by-1 — the difference is at most 1 literal in the final simplified clause.

A more complex example is that of speeding up self-subsuming resolution. The problem was that keeping all clauses attached in the watchlists slowed down self-subsuming resolution. However, the obvious solution of detaching all clauses and reattaching them after self-subs. resolution was incomplete, strangely enough. With multiple cluster experiments it became apparent that keeping clauses attached is important, because some propagations are made during the running of the algorithm, and propagating these (very few!) variables is crucial. The solution was to keep natively stored (2- and 3-long clauses) in the watchlists. These natively kept clauses don’t have pointers in the watchlists, and therefore their representation at the pointer’s location can be updated. The watchlists are therefore partially cleaned, then self-subsuming resolution is carried out, then the watchlists are fully cleaned and finally the watchlists are fully populated — not exactly a trivial algorithm, but it achieves both goals of fast and complete self-subsuming resolution.


Off-topic: For those who like contemporary art, here is a gem

Documenting CryptoMiniSat

Documenting code is not always so much fun. However, in order for the code to be extended by others, documentation needs to exist. Since I conceived CryptoMiniSat as a program to be extended by others, documentation was a must. So now, after investing about 2 weeks into documentation, it is finally ready. All major classes have been documented, along with all major functions and internal data structures. The number of comment lines I added is around a thousand, all in the Doxygen format. A preliminary HTML version is available here. I hope the quality of the documentation will improve with time, and that others might correct and add more documentation as they update the program.

While documenting the code, it occurred to me that certain variable and function names were really awkward, or reflected the state of the class from an earlier version of the code. These variables and functions have been renamed, and some even removed, since they served no purpose other than making the code bigger for no reason. I have also found a number of TODOs while going through the code: sometimes I implemented things the fast way instead of the correct way, so some data structures are really strange, slow, or both. The sourcecode with all the documentation is available here, from the gitorious code repository. I will soon make a 2.6.1 release that contains not only this newly added documentation, but other additions as well.