When someone looks at the speed of solving when using parallel SAT solvers, it is easy to see that scaling doesn’t really work — especially for the hard UNSAT problems. In the past months I have been struggling to think of an architecture that will enable me to write a SAT solver that solves this problem. Let me share with you where I got. It will be more of a brainstorming than a descriptive blog post, but maybe it will help you avoid pitfalls, or maybe just to get a better picture of parallel SAT solving.
The constraints:
- Clauses must be shared… all the time. Not “when it’s good enough” as per ManySat — that architecture doesn’t work well for UNSAT problems. The performance of a fully shared scheme should be able to overtake a ManySat-like architecture with ease. Only sharing low-glue or high-activity clauses does not work. This is easy to test: write a single-threaded SAT solver with high-glue learnt clause detaching on backtracking. It will work decidedly worse, even though you would clean those clauses anyway when doing clause cleaning. The reason? I don’t know… probably we need those clauses in our current local (=var activities similar) search.
- Clauses need to have glues and activities. Glues could be dynamically updated, but the gain is not that much and the burden is a bit too much — so make that static. Activities (which are inherently dynamic) are important though (see CryptoMiniSat), and they must be lock-free as they would need to be updated frequently.
- Memory is scarce when implication caching is used. Implication cache must be shared and updated across the parallel solvers. This not only lowers memory usage, but is an important way to share information other than clauses.
- Parallel solvers must be lean otherwise the burden of programming a correct solver would be far too much
- Simplifications must be carried out regularly, which needs the coordinated stopping / interruption of all parallel solvers
My architectural solution:
- Make a lean class that only handles: clause attachment, detachment and propagation. I will need to add on-the-fly hyper-binary resolution and on-the-fly useless binary clause removal as well, as those are tightly coupled to propagation
- Make a class that handles search&conflict generation. This class has the lean class as parent. This generates conflict clauses and has its own variable & clause activities. The class checks for new conflict clauses form other solvers after every conflict it generates, and attaches those clauses when that doesn’t need trail-rewriting — though this latter could be made optional. This class does not store a list of clauses. The tricky part of this class is on-the-fly subsumption. We will handle that by not shortening the clause but instead adding a new one and detaching the old one. The thread-coordindating class will handle the propagation of this update to all other threads. Checks for interruption are done after every generated conflict.
- A thread-coordinating class that handles clause storage, clause distribution, and simplification. This class stops all threads if UNSAT or SAT is found by one thread, and interrupts them if clause-cleaning needs to be carried out. This class also has the lean class as parent, since e.g. clause vivification or variable-replacement need propagation.
Data structures:
- Fully shared clauses. No move-to-front strategy, instead literal-sorting is carried out once in a while (during simplification I would assume). This has already been implemented, and is only about 10% slower than non-shared clauses in single-threaded mode.
- Interrupt handler. This is just a flag set by the coordinating solver. The subsolvers check this flag after every conflict generated, and either exit (if SAT/UNSAT is found) or just stop and mark each learnt clause locked (if cleaning the learnt clause database)
- Implication cache. Needs read-write lock, as most of the time it will be read to do on-the-fly self-subsuming resolution with non-existent binary clauses. Can only have literals added, not removed. Removal and updating of literals is performed only during simplification. Updating (i.e. writing) is done by threads only when doing propagation at decision level 1.
- Distribution of newly learnt clauses. A ‘lock-free’ (how I don’t like this — it’s misleading, it has locks) datastructure for each thread in the thread-coordinating class that allows the storage of a pre-fixed number of clauses, accessible & writeable without locks until the buffer is filled (at which point we lock & update pointers).
Dilemmas:
- Maybe thread-management while searching (i.e. learnt clause cleaning and sharing) should be a different class from the one that performs simplification. The former would not need to have propagation at all. Separation of duties always leads to cleaner code.
- Single-threaded solving is now only an afterthought, and it would be interesting to see how the solver performs in single-threaded mode. I hope it will be only slightly worse than if it didn’t have this architecture around it. It is interesting to note, though, that the structure of the solver will be much cleaner than huge-blob single-threaded solvers tend to be.
- OpenMP will probably not suffice, as read-write locks are not available, and implication cache sharing is a very important part of the architecture. Fully locking that datastructure when using it would incur too much overhead — most (99.9%) of the time we will be reading it. This means the code might not be as portable as I wished it would be. I think the read-write lock is the only thing I am missing to implement all of this using OpenMP.
Notes:
- This architecture lets the designer easily change the propagation mechanism. Just re-write the lean class. I have actually done that, and I can have both move-to-front (does not work with sharing) and static clause propagation strategies. People could write their own with ease.
- Conflict generation&search class should have an object pointer passed to it that is an instance of a class that has a abstract search class as its parent. We could then change restart settings on-the-fly by passing different object pointers.
- We could save state (i.e. stop&restart search) easily by interrupting all threads and saving the clause database (+learnt clause cleaning state). Simplifications should be fully stateless, so no state would need to be saved for those.
Okay, this is as far as I got. I have the lean class and the search&conflict generation class, the shared clause architecture, and the stateless simplifications ready, but the coordinating class is currently highly entangled with the conflict generation class, so those need to be separated. Further, I want to fix on-the-fly subsumption because it seems to be an easy way to improve the efficiency of conflict generation: if we can shorten a non-learnt clause at every 10th conflict, that is essentially a surely good conflict clause that will never be cleaned. The effect of not having to clear clauses is important when e.g. 6x more conflicts (for a 6-core CPU) are being generated per second than with a single-threaded solver.
Comments
0 responses to “Dreaming up a parallel SAT architecture”
I’m looking forward to see how this develops. Is this on one of the branches?
Sharing activities seems like a bold decision :) — letting threads interfere with each-others black magic.
Good point, but I actually wanted to have activities for each clause stored locally at each thread. There is already a piece of information associated with each clause for each thread (the locations of the watches), so each clause already has a unique number, which can be used as an index to the thread-local activities. So, the overhead won’t be too much. Naturally, I would never try to create race conditions ;) As for learnt clause cleaning, these activities would have to be merged somehow, which may not be trivial at all. I can think of reasons to use both max() and sum() for example… Any ideas welcome :)
You could put a reference count in each clause that is updated whenever the clause is attached/detached in a specific thread.
Only clauses which have reference count 0 can be cleaned, so effectively the clause is deleted when no thread wants to use it anymore.
(All this assuming that clauses can be attached/detached in individual threads.)
If you want to keep using a clause allocator like MiniSAT and CryptoMiniSAT do, I think each thread should have its own allocator, but allow other threads to access clauses allocated by a different thread. So whe n doing reduceDB() in a single thread, it should just skip over clauses with non-zero reference counts (modulo the “owning” thread).
Cleaning then becomes a two-stage process:
1. Going through the attached clauses, detaching the ones we don’t want to use anymore (like reduceDB() does in MiniSAT)
2. Going through the clauses allocated by this thread, deleting (i.e. freeing) the ones that are not used by any thread.
Cache invalidation due to updating of reference counts should be minimal since it is only updated on attach()/detach().
In the worst case, at least one thread wants to keep every clause, and this could mean that we use more memory, however, I don’t think it’s a problem in practice.
Hi Vegard,
Good idea (as usual!), but some details are needed. Cleaning clauses is glue-based (i.e. static, not dynamic), so if a thread wants to delete a clause, the others would like to do so as well. You seem to be talking about an architecture where this doesn’t hold. That would probably be a better architecture, though — I am just saying that the current arch is not like that.
Second, the memory arch of CryptoMS&MiniSat uses a continuously allocated memory piece (well, for CryptoMS, not really, but almost) so using multiple threads with multiple allocators would make things a bit more complicated, though in CryptoMS they are already this complicated, so adding one more layer isn’t a big deal ;) In any case, you have to think about this extra layer of indirection — an indirection that will be in (level-1) cache all the time, so can be safely ignored, but anyway, just saying.
What you said is nice though if you manage to get a dynamic clause activity thing working. At the moment, there are two possibilities: glues (static), and MiniSat-type scores (dynamic). However, solvers using the dynamic one aren’t as good as the ones using the static one. So, a good mix would be needed. It’s all up for grabs, as usual. Once the right mix is found (note that CryptoMS didn’t MIX them, just used one or the other based on an (almost)static selector), making it work in a multi-threaded environment would probably be great, since you could have each thread use a different mix i.e. 50-50% for one thread, and 10%-90% for another.
I hope helped,
Mate
“Cleaning clauses is glue-based (i.e. static, not dynamic), so if a thread wants to delete a clause, the others would like to do so as well. You seem to be talking about an architecture where this doesn’t hold.”
Well, I started from your earlier comment:
“I actually wanted to have activities for each clause stored locally at each thread. […] As for learnt clause cleaning, these activities would have to be merged somehow, which may not be trivial at all.”
Anyway, there’s also a much better way to do this than I first thought. The atomic reference count updates are not needed at all (!!!)
Instead of atomically modifying each clause’s reference count when it is detached in a thread, the clause index can be pushed to a vector while doing the cleaning. Then, when cleaning is over, this vector can be sent to the thread owning these clauses using a single atomic compare-and-exchange. The recipient thread can then iterate over the vector, decrementing the reference counts non-atomically, and delete (the memory for) the clauses which are no longer needed.
It’s really a very simple idea: Batch updates to minimise communication between threads. Yet another example of the “lazy” paradigm?