# Faster cleaning of the learnt clause database

In SAT solvers, removing unneeded learnt clauses from the clause database sounds like a trivial task: we somehow determine which clauses are not needed and we call remove() on them. However, in case performance is an issue, it can get a bit more more complicated.

### The problem

The issue at hand is that clauses are stored in two places: as a list of pointers to the clauses and in a list of lists called the watchlist. Removing clauses from either list can be an $O(n^2)$ operation if we e.g. remove every element from the list, one by one. In fact, an old version of the most popular SAT solver, MiniSat2, used to do exactly this:

Here, removeClause() is called on each clause individually, where removeClause() eventually calls remove() twice, where remove() is a linear operation:

It is clear that if the number of learnt clauses removed is a significant percent of all clauses (which it is after some runtime), this is an $O(n^2)$ operation.

### My original solution

My original solution to this problem was the following. First, I did a sweep on the watchlist and detached all learnt clauses. This is an $O(n)$ operation. Then, I ran the algorithm above, without the removeClause(). Finally, I attached the remaining learnt clauses: again an $O(n)$ operation. This solution is significantly faster than the MiniSat one as its worst-case runtime is only $O(n)$. The improvement is measurable — worst-case cleaning times dropped from seconds to tenths of seconds. However, it can be further improved.

### The improved solution

The improvement that came to my mind just yesterday was the following. I can keep a one bit marker in each learnt clause that indicates whether the clause needs to be detached or not. Then, I can run the algorithm as above but replace removeClause() with markclause() and run through the watchlists once to remove (and free) the marked clauses. This works really well and it only necessitates one sweep of the watchlists, without any useless detach+reattach cycles.

The newer GitHub version of MiniSat also marks the clauses instead of detaching them immediately and then removes them in one sweep, later. Interestingly, it keeps a list of ‘dirty’ occurrence lists and only goes through the ones that need removal. I find that a bit strange for this specific purpose: usually almost all watchlists are affected. In other cases, though, keeping dirty lists in mind can be a good idea, e.g. if only few clauses are removed for some optimization step.

# TreeLook and transitive reduction

The paper by Heule et al. about hyper-binary resolution using intree-based lookahead is pretty funky. The idea is actually quite simple (and as usual, not exactly trivial to come up with): we re-use past propagations by reversing the order in which literals are normally enqueued.

### A simple example

First, a queue is built that starts with a leaf literal and then follows it up through binary clauses until it can. Then it backtracks (adds to the queue a special, * element) and continues. The point of the queue is to have an example order that we can use to dequeue literals from in reverse propagation order. Obviously, there are many different orders in which we can build this queue and I wouldn’t be surprised if there are some nice heuristics one can use. Let’s just assume we have such a queue.

For example if y leads to x, then an example queue will have first element x and then y. So we first enqueue x, propagate, and then we enqueue y. If x already fails, there is no point in enqueuing y (and y is failed along with x). If both y and z lead to x but only z fails, then we don’t have to perform the propagations done by x twice: We enqueue x, propagate, create new decision level, enqueue y, propagate (nothing fails), backtrack 1 decision level, enqueue z, and now we fail. Notice that we didn’t have to propagate x twice even though we probed two literals (y and z) that both entailed x.

### Failing mid-way

The paper mentions failed literals that fail mid-way while dequeueing elements. We obviously cannot simply enqueue these literals, as they would be unset next time we backtrack. So these have to be kept in an array and set later, when we are at decision level 0. Further, once we are in a failed state, anything dequeued that is at the same or lower level also fails, so we need to keep an indicator of failure for these literals.

### Keeping reasons updated

Let’s suppose we enqueued x and propagated it. Next is y. We enqueue y… but we need to know what is the reason why x got set. The reason is of course the binary clause that we examined when we built the queue: (x, ~y). The reason is needed to be set because we will be jumping backwards through the implication graph to the deepest common ancestor to attach the new hyper-binary clause there. When jumping back, we might need to go back all the way to y, through x. In order to perform transitive reduction (as explained later), we need to know if the binary clause (x, ~y) was redundant or irredundant. This information needs to be stored in the queue and every time we dequeue a new literal y the reason of the previously enqueued literal needs to be set to the inverse of the currently enqueued literal i.e. ~y.

### Transitive reduction

Updating reasons becomes a real problem in case we wish to perform transitive reduction. Transitive reduction removes binary clauses that are useless from a binary implication graph reachability perspective. However, if it removes a binary clause that is later used by the queue to update a reason, we encounter a problem. We may update a literal with a reason that is no longer valid as the corresponding binary clause has been replaced by a chain of binary clauses. Later transitive reductions will take into account that this binary clause exists (it doesn’t) and will make further reductions that may be incorrect. In particular, further transitive reductions might remove an element of the chain itself — kind of like biting our own tail.

There seems to be a couple of options to fix the problem:

1. Not to perform transitive reduction at all. This may have been the intention of the designers, as the BCP_NHBR function does not perform transitive reduction.
2. Update the queue to reflect the changed set of binary clauses. Unfortunately this would be very expensive and thus basically not doable in reasonable about of time as far as I can tell.
3. Never remove binary clauses that are used for the queue. This means we need to mark such clauses and then check for markings when removing binary clauses. This is the implementation that I chose. We can immediately unmark a clause once the corresponding element has been dequeued, making it possible to remove it later. In CryptoMiniSat I simply unmark all binary clauses at the end — it’s faster.

### Conclusions

I remember some people always asking me why I haven’t yet implemented intree-based probing. It is much faster than normal probing. However, it’s not perfect. For example, it cannot be used to perform a fast depth-first walk of the tree and as such stamping is not really possible while doing it — always updating closing times for already dequeued elements seems to defeat the purpose of the whole idea (i.e. reversing the propagation order). Secondly, I haven’t yet found a way to efficiently perform Stalmarck while doing intree probing. Thirdly, it’s not exactly trivial to implement — as explained above.

# On using less memory for binary clauses in lingeling’s watchlists

Armin Biere gave a lecture at the Pragmatics of SAT workshop (proceeedings here) in Vienna about all the things inside lingeling which won a lot of awards[PDF] this year. If you weren’t there, you missed an amazing presentation. In this blog post I’ll reflect on a particular part of the presentation dealing with a memory trickery that has been intriguing me for a long while but I did not implement. Before I begin let me say: the presentation was awesome, and it’s not by chance that lingeling won so many awards.

### The idea

The idea used by lingeling I want to talk about is easy to explain (though not easy to invent, as is usually the case). If you look at typical CNF problems, the majority of the clauses will be binary, i.e. only contain 2 literals. These clauses used to be stored exactly the same way as normal clauses: in the heap we allocate 2 literals and we put a pointer into the watchlist to these literals.

An improvement over this idea are the so-called implicit clauses. For the binary clause “x V y” we put into the watchlist of x the literal y, and in the watchlist of y the literal x. There is no other place we store these binary clauses, hence the word “implicit”. For other clauses, we still put pointers into the watchlist and allocate space, as usual. The problem with this approach is that the pointer to a clause is 32b (we use an 32b offset on 64b machines) but for each clause we also store a so-called blocking literal in the watchlist, which is also 32b. That makes the entries in the watchlist 64b long for normal clauses, and 32b long for binary clauses.

The idea is to have differing sizes of elements in a watchlist. If e.g. the first bit of the element is a 1, the next 63b relate to a long clause, and if it’s a 0, then the next 31b relate to a binary clause. In case 80% of the clauses are binary, this saves 50% of the space in 80% of the cases. Not bad at all.

The advantages of using this idea are twofold. First, as already mentioned, memory use is lowered. This is non-trivial as memory usage of the watchlists can be enormous and although many other improvements can be switched off (such as e.g. stamping), storage of the clauses can never be switched off. Secondly, not having holes in memory leads to much better cache usage which in turn can bring real speedup. In case you think this is not important, you might enjoy knowing that the HHVM module of Facebook was made over 2x faster by making sure that important cache lines are not knocked out[PDF].

In case you have an array that has varying size elements in it, some non-trivial complications arise. Let me list just a few that come to my mind.

Sorting the list is no longer trivial You cannot just swap elements: they might not fit. One way to do sorting efficiently is to move all the data into another, equally-spaced container and sort there, then move it back in. However, keep in mind that the reason why quicksort is so fast is that it can do in-place sort. Merge-sorting would be another option, but it copies elements and it’s not by chance it’s not the default sort in most cases. Also, you would have to re-write merge-sort of course, to deal with the varying-sized elements.

In case you think that sorting is not needed, maybe you forgot to consider the lightning-quick subsumption you can do between implicit binary&tertiary clauses using sort to give just one example.

Removing an element is no longer an O(1) operation In case you need to remove element X in a watchlist, you can simply swap the last element to the position of X and make the array one smaller. This trick is used quite extensively since the order of the watchlist is usually irrelevant.

Loops need to be re-written All your loops that go through the watchlist need to be re-written and have to have a switch() in them with some pointer arithmetic (do we advance by 32b or 64b?). In case you think you don’t need to go through the watchlist too often anyway, think again. Any time you need to do anything with clauses you will have to go through the watchlists, since they are the only place where binary (and tertiary, if your system is optimised enough) clauses are stored. This means your watchlist-gothrough function will be absolutely everywhere in the code unless you don’t want to implement any pre- or in-processing.

You might think you could write a function and just pass a pointer to another function to it that does the ‘real’ job, essentially hiding the complexity in a function that you only need to write once. There are three problems with this. First, this will be a tight loop and so performance is important, which you will loose as you will need to dereference the passed-in function pointer every time. This can be overcome with the use of templates but it won’t make the code pretty. Secondly, your original, hiding function will need to be written more than a single time. For example, some such executions will need to count time (operations) and some won’t. You will need to count pointer dereferences (normal clause is fetched) and binary clauses (no pointer dereferenced) in a significantly different way as a cache-miss is very expensive and a clause-access will cause such a cache-miss most of the time. For performance reasons you will need variations that don’t dereference long clauses, variations that allow for manipulation of the array, variations that don’t, etc.

Maintaining datastructure consistency becomes harder Unless you use hiding functions, which is non-trivial as explained above (and maybe impossible in e.g. plain C), the complexity to maintain consistency of the datastructure will be all over the code. Even if done very carefully, the constraints on the datastructure may end up being implicitly, rather than explicitly, represented in the code. It will make it easier to create bugs and harder to find them.

### Conclusions

This idea of using less memory for binary clauses in the watchlists is very interesting and has intrigued me for a long while — Armin was kind enough to tell me about this a long time ago. It has the potential to save a lot of memory and to keep things more packed in the datastructure that is arguably the most accessed during solving and inprocessing. However, I was always daunted by the obstacles I saw in front of me — though I might simply need to understand C++11 and templates better to make it work.

Currently, I feel like there are plenty of other optimisations that I could implement from the talk of Armin, e.g. that all watchlists are stored in the same array, using offsets and a hand-rolled memory manager. That seems to have a potential of also improving the memory usage and speed while being easier to implement and easier to hide in a class.

# CryptoMiniSat 4.2 released

CryptoMiniSat 4.2 has been released. This release brings multi-threading, some bug fixes, and a lot of code cleanup.

Multi-threading has been implemented using the std::thread class of C++11. This makes it very portable and at the same time easy to use. Multi-threading is very simple and only shares unitary and binary learnt clauses. This is in comparison to other approaches that have some form of complex clause-sharing algorithms, sometimes even sharing clause databases. However, this system works when it’s used as a library, even with assumptions. Simply call set_num_threads(N) before calling any other functions.

The method used to speed up the system is the portfolio method, i.e. there are many threads started with different parameters and they share some information among them. The threads are configured as:

These configurations have been chosen because they seemed to have quite orthogonal parameters. Only 12 threads are properly configured, the rest are not really configured and are only cleaning a lot more clauses than normal (so as not to run out of memory). In a certain sense, the above is the “secret sauce” that makes the parallel system work.

### Code cleanup

The code has been greatly refactored. This is an ongoing effort, but its fruits are already quite visible. In general, variable and function names are more meaningful, function sizes have been drastically cut and the expressiveness of the code has been improved.

Unfortunately, C++ (and C) are quite limiting in a number of ways, and so CryptoMiniSat might move to other languages such as Go. Go for example provides reflection and significantly improved compile times. These two are very useful for development: the former greatly simplifies testing while the latter allows for quicker build (and thus debug) cycles.

# SMT Competition’14 and STP

The 2014 SMT competition‘s QF_BV divison is over, and the results are (copied from here):

 Solver Errors Solved Not Solved CPU time (solved instances) Boolector 0 2361 127 138077.59 STP-CryptoMiniSat4 0 2283 205 190660.82 [MathSAT] 0 2199 289 262349.39 [Z3] 0 2180 308 214087.66 CVC4 0 2166 322 87954.62 4Simp 0 2121 367 187966.86 SONOLAR 0 2026 462 174134.49 Yices2 0 1770 718 159991.55 abziz_min_features 9 2155 324 134385.22 abziz_all_features 9 2093 386 122540.04

First, let me congratulate Boolector by Armin Biere. STP came a not-so-close second. STP was meant to be submitted twice, once with MiniSat and once with CyrptoMiniSat4 — this is the only reason why the STP submission is called STP-CryptoMiniSat4. Unfortunately, the other submission could not be made because of some compilation problems.