Tag Archives: Optimisation

On using less memory for binary clauses in lingeling’s watchlists

Armin Biere gave a lecture at the Pragmatics of SAT workshop (proceeedings here) in Vienna about all the things inside lingeling which won a lot of awards[PDF] this year. If you weren’t there, you missed an amazing presentation. In this blog post I’ll reflect on a particular part of the presentation dealing with a memory trickery that has been intriguing me for a long while but I did not implement. Before I begin let me say: the presentation was awesome, and it’s not by chance that lingeling won so many awards.

The idea

The idea used by lingeling I want to talk about is easy to explain (though not easy to invent, as is usually the case). If you look at typical CNF problems, the majority of the clauses will be binary, i.e. only contain 2 literals. These clauses used to be stored exactly the same way as normal clauses: in the heap we allocate 2 literals and we put a pointer into the watchlist to these literals.

An improvement over this idea are the so-called implicit clauses. For the binary clause “x V y” we put into the watchlist of x the literal y, and in the watchlist of y the literal x. There is no other place we store these binary clauses, hence the word “implicit”. For other clauses, we still put pointers into the watchlist and allocate space, as usual. The problem with this approach is that the pointer to a clause is 32b (we use an 32b offset on 64b machines) but for each clause we also store a so-called blocking literal in the watchlist, which is also 32b. That makes the entries in the watchlist 64b long for normal clauses, and 32b long for binary clauses.

The idea is to have differing sizes of elements in a watchlist. If e.g. the first bit of the element is a 1, the next 63b relate to a long clause, and if it’s a 0, then the next 31b relate to a binary clause. In case 80% of the clauses are binary, this saves 50% of the space in 80% of the cases. Not bad at all.

The advantages

The advantages of using this idea are twofold. First, as already mentioned, memory use is lowered. This is non-trivial as memory usage of the watchlists can be enormous and although many other improvements can be switched off (such as e.g. stamping), storage of the clauses can never be switched off. Secondly, not having holes in memory leads to much better cache usage which in turn can bring real speedup. In case you think this is not important, you might enjoy knowing that the HHVM module of Facebook was made over 2x faster by making sure that important cache lines are not knocked out[PDF].

The disadvantages

In case you have an array that has varying size elements in it, some non-trivial complications arise. Let me list just a few that come to my mind.

Sorting the list is no longer trivial You cannot just swap elements: they might not fit. One way to do sorting efficiently is to move all the data into another, equally-spaced container and sort there, then move it back in. However, keep in mind that the reason why quicksort is so fast is that it can do in-place sort. Merge-sorting would be another option, but it copies elements and it’s not by chance it’s not the default sort in most cases. Also, you would have to re-write merge-sort of course, to deal with the varying-sized elements.

In case you think that sorting is not needed, maybe you forgot to consider the lightning-quick subsumption you can do between implicit binary&tertiary clauses using sort to give just one example.

Removing an element is no longer an O(1) operation In case you need to remove element X in a watchlist, you can simply swap the last element to the position of X and make the array one smaller. This trick is used quite extensively since the order of the watchlist is usually irrelevant.

Loops need to be re-written All your loops that go through the watchlist need to be re-written and have to have a switch() in them with some pointer arithmetic (do we advance by 32b or 64b?). In case you think you don’t need to go through the watchlist too often anyway, think again. Any time you need to do anything with clauses you will have to go through the watchlists, since they are the only place where binary (and tertiary, if your system is optimised enough) clauses are stored. This means your watchlist-gothrough function will be absolutely everywhere in the code unless you don’t want to implement any pre- or in-processing.

You might think you could write a function and just pass a pointer to another function to it that does the ‘real’ job, essentially hiding the complexity in a function that you only need to write once. There are three problems with this. First, this will be a tight loop and so performance is important, which you will loose as you will need to dereference the passed-in function pointer every time. This can be overcome with the use of templates but it won’t make the code pretty. Secondly, your original, hiding function will need to be written more than a single time. For example, some such executions will need to count time (operations) and some won’t. You will need to count pointer dereferences (normal clause is fetched) and binary clauses (no pointer dereferenced) in a significantly different way as a cache-miss is very expensive and a clause-access will cause such a cache-miss most of the time. For performance reasons you will need variations that don’t dereference long clauses, variations that allow for manipulation of the array, variations that don’t, etc.

Maintaining datastructure consistency becomes harder Unless you use hiding functions, which is non-trivial as explained above (and maybe impossible in e.g. plain C), the complexity to maintain consistency of the datastructure will be all over the code. Even if done very carefully, the constraints on the datastructure may end up being implicitly, rather than explicitly, represented in the code. It will make it easier to create bugs and harder to find them.

Conclusions

This idea of using less memory for binary clauses in the watchlists is very interesting and has intrigued me for a long while — Armin was kind enough to tell me about this a long time ago. It has the potential to save a lot of memory and to keep things more packed in the datastructure that is arguably the most accessed during solving and inprocessing. However, I was always daunted by the obstacles I saw in front of me — though I might simply need to understand C++11 and templates better to make it work.

Currently, I feel like there are plenty of other optimisations that I could implement from the talk of Armin, e.g. that all watchlists are stored in the same array, using offsets and a hand-rolled memory manager. That seems to have a potential of also improving the memory usage and speed while being easier to implement and easier to hide in a class.

Why it’s hard to eliminate variables

Let’s examine why it’s hard to eliminate variables. I remember the code I looked at in SatElite that did it: it was crazy clean code and looked like it was pretty easy to perform. In this post I’ll examine how that simple code became more than a 1’000 lines of code today.

What needs to be done, at first sight

At first sight, variable elimination is easy. We just:

  1. Build occurrence lists
  2. Pick a variable to eliminate
  3. Resolve every clause having the positive literal of the variable with negative ones.
  4. Add newly resolved clauses into the system
  5. Remove original clauses.
  6. Goto 2.

These are all pretty simple steps at first sight, and one can imagine that implementing them is maybe 50-100 lines of code, no more. So, let’s examine them one-by-one to see how they get complicated.

Building occurrence lists

The idea is that we simply take every single clause, and for every literal they have, we insert a pointer to the clause into an array for that literal’s occurrences. This sounds easy, but what happens if we are given 1M clauses, each with 1000 literals on average? If you think this is crazy, it isn’t, and does in fact happen.

One option is we estimate the amount of memory we would use and abort early because we don’t want to run out of memory. So, first we check the potential size, then we link them in. Unfortunately, this means we can’t do variable elimination at all. Another possibility is that we link in clauses only partially. For example, we don’t link in clauses that are redundant but too long. Redundant clauses are ignored during resolution when eliminating, so this is OK, but then we will have to clean these clauses up later, when finishing up. However, if a redundant clause that hasn’t been linked in backward-subsumes an irredudant clause (and thus becomes irredundant itself), we have to link it in asap. Optimisation leads to complexity.

We don’t just want to link these clauses in to some random datastructure. I believe it was Armin Biere who put this idea into my head, or maybe someone else, but re-using watchlists for occurrence lists means we use our memory resources better: there won’t be so much fragmentation. Furthermore, an advanced SAT solver uses implicit binary & tertiary clauses, so those are linked in already into the watchlists. That saves memory.

Picking a variable to eliminate

The order in which you eliminate clauses is a defining part of the speed we get with the final solver. It is crucially important that this is done well. So, what can we do? We can either use some heuristic or precisely calculate the gain for each variable, and eliminate the best guess/calculated one first. These are both greedy algorithms but I think given the complexity of the task, they are the best at hand.

Using precise calculation is easy, we just resolve all the relevant clauses but don’t add the resolvents. It’s very expensive though. A better approach is to use a heuristic. Logically, clauses that have few literals in them are likely not to resolve such that they become tautologies. It’s unlikely that two binary clauses’ resolvent becomes a tautology. It’s however likely that large clauses become tautological once resolved. I take this into account when calculating elimination cost for variable. Since redundant clauses are linked in the occurrence lists so that I can subsume them, I have to skip them.

It’s not enough to calculate the heuristic once, of course. We have to re-calculate after every elimination — the playing field has changed. Thus, for every clause you removed, you have to keep in mind which variables were affected, and re-calculate the cost for each after every variable elimination.

Resolving clauses

The base is easy. We add literals to a new array of literls and mark the literals that have been added in a quick-lookup array. If the opposite of a literal is added, the markings tell us and we can skip the rest — the resolvent is tautological. Things get hairy if the clause is not tautological.

What if the new clause is subsumed by already-existing clauses? Should we check for this? This is called forward-subsumption, and it’s really expensive. Backward subsumption (which asks the question ‘Does this clause subsume others?’ instead of ‘Is this clause subsumed by others?’) would be cheaper, but that’s not the case here. We can thus try to subsume the clause only by e.g. binary&tertiary clauses and hope for the best.

What if the new clause can be subsumed by stamps? That’s easy to check for, but if the new clause was used to create the stamp, that would be a self-dependency loop and not adding the resolvent would lead to an incorrect result. We can use the stamps as long as the resolving clauses were not needed for the stamp: i.e. they are not binary clauses and on-the-fly hyper-binary resolution was used during every step of stamp generation. A similar logic goes for using the implication cache.

We could also virtually extend the clause with literals using watchlists/stamps/impl. cache and then try to subsume that virtual clause. I forgot what 3-letter acronym Biere et al. gave to this method (it’s one of the 12 on slide 25 here), but, except for the acronym, this idea is pretty simple. You take a binary clause, e.g. xV~y, and if x is in the newly created clause, but y and ~y is not, you add y to the clause. The clause is now bigger, so has a larger chance to be subsumed. You now perform forward subsumption as above, but with the extended clause. Also, take care not to subsume clauses with themselves, which, as you might imagine, can get hairy.

If all of this sounds a bit intricate, this is not even the difficult part. The difficult part is keeping track of time. Where of course by time I don’t actually mean seconds — I mean computation steps that you have to define one way or another and increment counters and set limits. Remember: all this has to be deterministic.

Doing all of the above with a small but complicated instance is super-fast, under 0.001s. With a weird instance where one single literal may occur in more than a million clauses, it can be very-very expensive even for one single try — over 100s. That’s about 5 orders of magnitude of difference. So, you have to be careful. The resolution we cannot skip, but we can abort it (and indicate it up in the call tree). Some of the others we can abort, but then the whole resolution has to be re-started. Some of the above is not critical at all, so you have to use a different time-limit for some, and mark them as too expensive, so at least the basic things get done. This gets complicated, because e.g. forward-subsumption you might want to re-use at other parts of the solver so you have to use a time-limit that isn’t global.

Adding the newly resolved clauses

Adding clauses is simple: we create and link them in. However, we can do more. Since backward-subsumption is fast, we can do that with the newly created clauses. Note that this means the newly created clause could subsume some of the original clauses it was created from — which means the resolvents should be pre-generated and kept in memory.

Another thing: since we know the new clause needs to be added, we might as well shorten it before in any way we can. At this point, we can make use of all the watchlists, stamps and implication cache we have to shorten the new clause: there are no problems with self-dependencies. It will pay off. However, note that shortening the clause before adding it means that we will have to reverse-shorten it later, when this clause might be part of a group of clauses that is touched by a new variable elimination round. So, we are working against ourselves in a way — especially because reverse shortening is pretty expensive and hairy as explained above.

Although this is obvious, but we still have to take care of time-outs. For example, if resolution took so much time that we are already out of time, we must exit asap and not worry about the resolvents. Don’t link, don’t remove, just exit. Time is of essence.

Removing the original clauses

Easy, just unlink them from the occurrence lists. I mean, easy if you don’t care about time, of course. Because unlinking is an O(N^2) operation if you have N clauses and all of them contain the same literal X — the N-long occurrence list of literal X has to be read and updated N times. So, we don’t do this.

First of all, a special case: the two occurrence lists of the variable we are removing can simply be .clear()-ed. It’s no longer needed. Secondly, we shouldn’t unlink clauses one-by-one. Instead, we should mark the clause as removed, and then not care about the clause later. Once variable elimination is finished, we do a sweep of all the occurrence lists and clauses and remove the clauses that have been marked. This means that e.g. forward and backward subsumption gets more hairy (we shouldn’t subsume with a clause that’s been marked as removed but is still in the occurrence list) but that O(N^2) becomes O(N) which for problems where N is large makes quite a bit of difference. Like, the difference of 100s vs. 10s for a the same exact thing.

The untold horrors

On top of what’s above, you might like to generate some statistics about what worked and what didn’t. You might like to dump these statistics to a database. You might like to not create resolutions that are not needed as the irreduntant clauses form an AND/ITE gate. Or multiple gates. You might like to eliminate only a subset of variables at each call so that you don’t make your system too sparse and thus reduce arc consistency. You might want to vary this limit based on the problem at hand. You might want to do many other things that are not detailed above.

Conclusions

Once I read through the above, I realized I kind of missed the essence: time-outs. It’s mentioned here and there, but it’s much more critical than it seems and makes things a hell of a lot harder. How do you cleanly exit from the middle of reverse-shortening while resolving because you ran out of time? I could just bury my head in sand of course and say: I don’t care. Or, I could make some messy algorithm that checks return values of each call and return a special value in case of time-outs. This needs to be done for every level of the call, which can be pretty deep, unless you like writing 1’500 line functions. I wanted to say writing&reading, but, really, nobody reads 1’500 line functions. They are throw-away,write-only code.

CryptoMinisat 3.1 released

CryptoMinisat 3.1 has been released. The short changelog is:

The changes made were threefold. First, memory usage has been greatly reduced. This is crucial, because memory usage was over 7GB on certain instances. Secondly, the implication cache wasn’t very well-used and an idea that came to my mind greatly improved performance on most problems. Finally, time limiting of some inprocessing techniques on certain types of problems has been improved.

Memory usage reduction

On instances that produced a lot of long learnt clauses the memory usage was very high. These learnt clauses were all automatically linked in to the occurrence list and consequently took large amounts of memory, sometimes up to 10GB. On other instances, the original clauses were too numerous and too large, so putting even them into the occurrence list was too much. On these instances, variable elimination is not carried out (or carried out only later, when enough original clauses have been removed/shortened). To debug some of these problems, I wrote a fuzzer that generates extremely large problems with many binary and many long clauses, it’s available here as “largefuzzer”. It’s actually quite nice with many-many binary clauses so it also can fuzz the problems encountered with probing of extremely weird and large instances.

Implied literal usage improvement

CryptoMiniSat uses implied literals, i.e. caches what literals were propagated by each literal during probing. It then re-uses this information to subsume and/or strengthen clauses. This is kind of similar to stamping though uses more memory. It is actually useful to have alongside stamping, and I now do both — propagating DFS that stamping requires is expensive though updating cache during DFS is just as easy as during quasi-BFS.

The trick I discovered while playing around with cached implied literals is that if literal L1 propagates L2 and also !L2 then that means there are conceptually two binary clauses in the solver (!L1, L2), (!L1, !L2), so !L1 is TRUE. This is of course trivial, but I never checked for this. The question most would raise is: why would L1 propagate both L2 and !L2 and not fail? The answer is kind of tricky, but very interesting. Let’s say at one point, L1 propagates L2 due to a learnt clause, but that learnt clause is then removed. A new learnt clause is then later learnt, and with that learnt clause in place, L1 propagates !L2. Now, without caching, this would be ignored. Caching memorizes past conceptual binary clauses and re-uses this information.

This is not an optimization that only looks good on paper, it is very good to have. With this one optimization, I gained 5 instances from the SAT Comp’09 instances with a 1000s timeout (196 solved -> 201 solved). I can’t right now imagine how this could be done with stamping effectively, but that doesn’t mean it’s not possible. Though, according to my experience, stamping doesn’t preserve that much information over time as it’s being updated (renumbered) frequently while the cache is only improved over time, never shrunk. A possibility would be to have more than one stamp system and round-robin selecting them. However that would mean that sorting of clauses (for shrinking) would need to be done more than once, and sorting them is already relatively expensive. I sometimes feel that what stamping gains in memory it looses on sorting (i.e. processing time) and lower coverage (re-numbering).

More precise time-limiting

Martin Maurer has been kind enough to file a lot of bug reports about probing and variable elimination taking too much time, sometimes upwards of 150s when they should take around 20-30s maximum. While investigating, it tuned out that the problem was very weird indeed. While trying to eliminate or probe one variable the time for that one variable took upwards of 100s. This was completely unexpected as the code only checked for timeouts on a per-variable basis. In the end, the code had to be improved to track time on an intra-variable basis in both systems. While at it, I also added intra-variable time-tracking to implicit clause subsumption and strengthening too. So, over-times should less prevalent from now on. As an interesting side-note, time-limiting on probing is now so fine-grained that a 32-bit unsigned integer would overflow within 15s if used as the time-tracker.

Implicit binary clauses

I have lately been trying to get CryptoMiniSat to use implicit binary clauses. The idea is that since binary clauses are very trivial (just two literals), and they don’t really need to keep state (like clause activity), they don’t really need to be stored at a separate location. Instead, they can be stored directly in the watchlists. There are a number of advantages and disadvantages that come with this approach.

The main advantage is a notable reduction of memory usage and memory fragmentation. The first is obvious: since we don’t allocate space for binary clauses separately, the memory usage of the program should go down. This is especially true since SAT problems usually contain a huge number of binary clauses. The secondary benefit, that of reduced memory fragmentation is not really that much of an advantage if someone uses, e.g. the boost pool library.

The disadvantages are mainly twofold. Firstly, bugs are very difficult to find. Since there is not one central database of binary clauses (as before), it becomes difficult to check the consistency of the watchlists. However, if inconsistencies creep in, then the solution found by the SAT solver could be wrong. Worst of all, consistency is difficult to keep, as binary clauses often need to be worked on by e.g. subsumption, variable elimination, etc. The second biggest disadvatage is that if a new algorithm comes up that needs a database of binary clauses, this database would need to be re-built every time to run that algorithm, which can be very costly in terms of time.

All in all, it took me about 1 day to implement implicit binary clauses, and about 3 days to debug it. Surprisingly, it came with some very positive side-effects. Firstly, the debugging session took out some very long-standing bugs in CryptoMiniSat. Secondly, since binary clauses represent most clauses in typical SAT problems, and since binary clauses cannot be subsumed by anything other than binary clauses, the subsumption algorithm has been notably speeded up and its memory usage has been reduced. The worst part of using implicit binary clauses has been the fact that I can no longer use binary clause-sorting to find binary xor clauses, and I must resort back to Tarjan’s strongly connected component finding algorithm. This algorithm is algorithmically faster (only O(V+E) versus O(n*logn) ), but practically slower, since I need to carry it out repeatedly, and I cannot save state. Furthermore, I haven’t yet coded it down, so I am using boost’s graph algorithm, which means that I now have a dependency in CryptoMiniSat. This will eventually be corrected, but it is already causing some trouble.

In the meantime, I have been parallelising CryptoMiniSat. Funnily enough, it took me about 2 days in total to multi-thread CryptoMiniSat versus the 4 days it took to implement implicit binary clauses. Oh well.

Asymmetric branching

Asymmetric branching is an algorithm that shortens CNF clauses in SAT Solvers. A clause, for instance v1 V v2 V v3 V v4 (where letters are binary variables and V represents binary OR) could possibly be shortened to v1 V v2 V v3. To find out if it can be, all we have to do is to put -v1,-v2 and -v3 into the propagation queue and then propagate. If we receive a conflict from the propagation engine, we can learn the clause v1 V v2 V v3, which (incidentally, though this was the point), subsumes the original clause, so we can simply remove variable v4 from the original clause.

Ok, so much for theory. Now comes the hard part: how do we do this such that it actually speeds up the solving? The problem is that asymmetric branching, when done on all possible clauses, is slow. However, its benefits could be large, since a shortened clause naturally leads a faster propagation and shorter resolution proofs thus less propagation need.

I have been experimenting in getting some benefit from asymmetric branching, and now it works extremely well for CryptoMiniSat. The trick I use, is that I first sort the clauses according to size, and only try to shorten with asymmetric branching the top couple of clauses. This ensures that the largest clauses are shortened first. Since the largest clauses contribute most the to size of the resolution proof and they are the slowest to propagate, this makes sense.

CryptoMiniSat tries to do asymmetric branching regularly, always for only a little while (~2 seconds). I believe this is useful, because as the amount of time the program has been trying to solve a problem increases, it makes sense that we have a bit more time to do things that could help resolve the problem faster. For instance, if CryptoMiniSat has been trying to solve a given problem for 30 minutes unsuccessfully with the standard clause-learning DPLL procedure in vain, we can allocate 2-3 seconds to possibly gain ~5-10% later. In the example case it can be assumed that since we haven’t been able to solve it for 30 minutes, probably we won’t solve it in the next 10 minutes, so gaining 5% on 10 minutes is 30 seconds, far more than the 2-3 seconds we invested.

The results with asymmetric branching with CryptoMiniSat are quite astounding. Using the 2009 SAT Competition benchmark set with an approximately correct timeout, CryptoMiniSat could normally solve 217 problems. With asymmetric branching, CryptoMiniSat can now solve 220 — a huge increase: last year, 16 solvers running in parallel could only solve 229 instances in total.

Edited to add (26/09/2010): Clause Vivification by Piette, Hamadi and Sais is a paper about the above described method, though with a number of key differences. The paper seems to advocate for a complete procedure, furthermore, it calls the conflict generation routine in certain cases. I believe that the above described way of carrying out this technique brings more tangible benefits, especially for larger problems.

(Updated: we need to branch on the inverse of the clause’s literals. Thanks to Vegard Nossum for spotting this)